July,  1968 


Report  E5L-R-355 


Copy  38 


RAPID  PRE -INDEXING  BY  MACHINE 
by 

William  R.  Kampe  II 


The  work  described  in  this  document  was  performed  as  part  of  Proj¬ 
ect  Intrex  under  Research  Grant  NSFC-472  (Part)  awarded  to  the 
Massachusetts  Institute  of  Technology  by  the  National  Science  Foun¬ 
dation  and  the  Advanced  Research  Projects  Agency  of  the  Department 
of  Defense.  This  grant  is  designated  as  M.I.T.  DSR  Project  No, 
70054. 


Electronic  Systems  Laboratory 
Department  of  Electrical  Engineering 
Massachusetts  Institute  of  Technology 
Cambridge,  Massachusetts  02139 


BIBLIOCP>?HIC  DATA 

1.  Report  No. 

SHIT  ' 

ES  L-R-355 

.  itie  and  Subtale 


Rapid  Pre-  Indexing  by  Machine 


3*  Recipient's  Accession  No. 


5*  Report  Date 

July,  1968 


7.  Author(a) 


William  R.  Kampe  II 


0*  Performing  Organization  Name  and  Address 

Electronic  Systems  Laboratory 
Massachusetts  Institute  of  technology 
Bldg .  35-311 

C am b r idee.  MA  02139 _ 


12.  Sponsoring  Organization  Name  and  Address 

National  Science  Foundation 
Office  of  Science  Information  Service 
1800  G.  Street,  N.  W. 


8.  Performing  Organization  Rept. 

No-  ES  L-R-355 _ 


10.  Project/Task/Work  Unit  No. 


11.  Contract  /Grant  No. 

NSFC-472(Part) 


13.  Type  of  Report  8t  Period 
Covered 

Technical  Report 


IS.  Supplementary  Notes 


16.  Abstracts^ This  report  describes  the  development  of  a  new  method  ot  subject  indexing  b 
machine  for  documents  in  the  Project  INTREX  catalog.  The  purpose  of  the  system  is 
to  allow  new  documents  to  be  placed  online  quickly  in  the  computer -stored  Intrex 
catalog.  The  system  that  is  developed  makes  use  of  human-generated  subject  terms 
of  existing  Intrex  documents  as  a  basis  for  generating  index  terms  for  new  documents 
The  pre -indexing  system  operates  on  only  the  title  and  abstract  of  a  document  in 
generating  a  pre-index  for  the  document.  The  analysis  of  documents  already  contain 
ing  human- gene  rated  subject  indexes  consisted  of  comparing  the  titles  and  abstracts 
of  the  documents  to  their  subject  indexes.  A  large  dictionary  with  data  about  word 
usage  was  obtained  from  these  comparisons.  The  dictionary  served  as  a  guide  for 
the  later  pre-indexing  of  new  documents.  Three  variations  of  the  automatic  pre-indecng 
method  have  been  developed,  tested,  and  evaluated.  Two  methods  show  promise  for 
operational  use  in  the  Intrex  system. 


17.  Key  Voids  and  Document  Analysis.  17a.  Descriptors 

Subject  indexing 
Automatic  indexing 
Information  retrieval 
Words  (language) 


17b.  Identifiets/Open-Ended  Terms 

Project  Intrex 
Automatic  pre -indexing 


Free  indexing 
Word  frequency 


17c.  COSAT1  Fie  Id /Group  ->  ° 


18-  Availability  Statement 

Release  unlimited 


19- Security  Class  (This 
Report) 


21.  No.  of  Pages 
68 


FORM  NTl*t*  00-70) 


USCOMd-DC  40»2t-R7t 


FOREWORD 


Except  for  minor  editorial  changes,  this  report  is  the  thesis  sub¬ 
mitted  by  Mr.  William  R.  Kampe  II  to  the  Electrical  Engineering 
Department,  Massachusetts  Institute  of  Technology,  in  partial  ful¬ 
fillment  of  the  requirements  for  the  degree  of  Master  of  Science.  A 
few  alterations  in  the  original  wording  have  been  made  throughout 
the  text  in  an  effort  to  enhance  clarity,  and  several  pages  have  been 
reformatted;  otherwise  the  manuscript  remains  as  submitted. 


J.  F.  Reintjes 

Professor  of  Electrical  Engineering 


CONTENTS 


CHAPTER  I  THE  NEED  FOR  PRE-INDEXING  page  i 

The  INTREX  Environment  1 

Motivation  for  Pre- Indexing  4 

Plan  of  the  Research  4 

CHAPTER  II  PHASE  I  -  DATA  GATHERING  6 

The  Analysis  of  Documents  Already 

Having  Subject  Indexes  8 

Building  a  Dictionary  12 

Possible  Pre-Indexing  Criteria  14 

CHAPTER  HI  THE  DEVELOPMENT  OF  PRE-INDEXING 

SCHEMES  1 8 

The  Form  of  the  Title  and  Abstract  for 
Pre-Indexing  18 

Three  Methods  of  Pre-Indexing  18 

Method  I  -  Usage  Rate  Only  20 

Method  H  -  Usage  Rate  and  Frequency 

Thresholds  20 

Method  IH  -  Context  Decisions  20 

Testing  the  Pre-Indexing  Methods  23 

CHAPTER  IV  RESULTS  OF  PRE-INDEXING  TRIALS  24 

Subjective  Evaluation  of  the  Informational 
Content  of  Pre-Indexes  24 

The  Role  of  Function  Words  27 

The  Effect  of  Varying  the  Usage  Rate 
Threshold  28 

The  Effect  of  New  Words  on  Pre-Indexing  29 

The  Completeness-Relevance  Trade-off  30 

The  Effect  of  Dictionary  Size  30 

CHAPTER  V  CONCLUSIONS  AND  SUGGESTIONS  FOR 

FURTHER  RESEARCH  32 

Streamlining  the  Pre -Indexing  System  33 

Applicability  of  Pre-Indexing  to  Other 
Subject  Areas  34 

Suggestions  for  Further  Research  34 

The  Possibility  of  Using  Word  Stemming  35 

vii 


I 


CONTENTS  (Contd. ) 


APPENDIX  A  BREAKDOWN  OF  WORDS  IN  DICTIONARY 
BY  USAGE  RATE  AND  FREQUENCY 


page  37 


APPENDIX  B  RESULTS  OF  PRE-INDEXING  TRIALS 


Method  I 


Method  II 


Method  III 


APPENDIX  C  PROGRAM  LISTINGS  AND  STRUCTURE 


Phase  I 


Phase  II 


BIBLIOGRAPHY 


LIST  OF  FIGURES 


1.  The  Pre-Indexing  System  page  4 

2.  The  Target  Set  of  Words  for  Pre-Indexing  (Shaded  Area)  7 

3.  The  Pre-Index  7 

4.  Typical  Subject  Index  9 

5.  Typical  Title  and  Abstract  9 

6.  TA  List  and  SI  List  10 

7.  Size  of  Dictionary  as  a  Function  of  the  Number  of 

Documents  Analyzed  13 

8.  Set  of  Words  Included  in  the  Dictionary  Data  Base 

(Shaded  Area)  13 

9.  Frequency  of  Appearance  for  Words  in  the  Dictionary 

Based  on  80  Document  Records  14 

10.  Average  Word  Usage  Rate  as  a  Function  Frequency 

of  Appearance  16 

11.  Word  Inclusion  Region  for  Method  I  21 

12.  Word  Inclusion  for  Method  II  21 

13.  Word  Zones  for  Method  HI  21 

14.  Results  of  Method  I  for  30  Documents  25 

15.  Results  of  Method  II  for  30  Documents  2  5 

16.  Results  of  Method  III  for  30  Documents  26 

17.  Comparison  of  Average  Results  for  all  Methods  on 

30  Documents  26 


ACKNOWLEDGMENT 


I  am  greatly  indebted  to  Professor  J.  Francis  Reintjes,  thesis  super¬ 
visor,  for  his  guidance,  suggestions,  and  critical  evaluation  of  all 
stages  of  this  research.  I  thank  Peter  Kugel  for  his  continued  en¬ 
couragement  and  for  his  endless  patience  in  discussing  new  technical 
approaches.  Thanks  are  also  expressed  to  Messrs.  Robert  Kusik, 
Richard  Marcus,  and  Alan  Benenfeld  for  their  explanations  of  the 
Project  INTREX  cataloging  process. 

In  addition,  I  am  grateful  to  Professor  Lynwood  Bryant,  who  spent 
much  time  studying  the  manuscript  and  making  pertinent  suggestions 
for  improvement.  I  also  thank  the  Publications  and  Drafting  Person¬ 
nel  of  the  Electronic  Systems  Laboratory  for  their  efforts  in  typing 
and  assembling  the  final  document. 

This  researc’i  was  supported  by  the  Department  of  Electrical  Engi  - 
neering  and  by  Project  INTREX.  Project  INTREX  is  supported 
through  grants  from  the  Carnegie  Corporation  and  the  Council  on 
Library  Resources,  Inc.,  and  through  Contract  NSFC-472  from  the 
National  Science  Foundation  and  Advanced  Research  Projects  Agency. 


VI 


CHAPTER  I 


THE  NEED  FOR  PRE-INDEXING 

This  research  involves  an  investigation  of  the  feasibility  of  using 
automatic  machine-generated  subject  indexes  as  a  possible  means  of 
expediting  the  cataloging  process  for  new  documents  to  be  added  to  the 
Project  INTREX  library.  An  online  library  in  a  fast-changing  technical 
field  needs  a  method  of  putting  information  about  a  new  document  into 
the  library  system  quickly.  A  possible  method  is  to  generate  an  auto¬ 
matic  pre -index  from  only  the  title  and  abstract  of  a  new  document.  The 
pre-index  is  intended  to  serve  as  a  temporary  subject  index  umil  human 
catalogers  can  replace  the  pre -index  with  a  full  subject  index  based  on 
an  examination  of  the  entire  new  document. 

The  pre-indexing  schemes  investigated  are  based  on  past  experi¬ 
ence  gained  from  a  computer  analysis  of  the  human-generated  indexes 
of  a  set  of  documents.  The  experience  thus  obtained  through  comparison 
of  title  and  abstracts  to  corresponding  subject  indexes  is  used  as  a  guide 
for  generating,  by  machine,  a  pre-index  for  a  new  document  to  be  added 
to  the  system.  The  pre-indexing  scheme  being  proposed  differs  from 
previous  work  in  automatic  indexing  in  that  the  pre-indexing  scheme 
makes  extensive  use  of  prior  human  indexing  in  an  automatic  fashion. 

The  INTREX  Environment 

Project  INTREX  (Information  Transfer  Experiments)  seeks  to 
exploit  the  multiaccess  computer  operated  in  an  online  mode  as  a  basis 
for  a  machine-stored  library.  The  INTREX  collection  will  consist  of 
approximately  10,000  documents  in  selected  fields  of  materials  science 
and  engineering.  Each  document  will  be  cataloged  in  depth  and  the  cata¬ 
log  will  be  stored  in  a  multiaccess  computer.  Full  text  of  each  docu¬ 
ment  will  be  stored  on  microfilm  and  made  accessible  under  computer 
control.  A  user  of  the  library  will  be  able  to  search  the  catalog  by 
means  of  a  time-shared  computer  system.  Special  remote  consoles  will 
allow  him  to  obtain  displays  of  the  full  text  of  documents,  or  hard  copies 
if  he  so  desires. 


-1  - 


-2- 


At  present,  INTREX  is  still  in  the  developmental  stage.  The-e 
are  no  users  of  the  system,  as  yet.  Nevertheless,  over  500  documents 
have  been  fully  cataloged  and  placed  into  the  computer  data  base,  acces¬ 
sible  to  the  INTREX  programmers  on  a  time-sharing  computer  system. 

It  is  this  data  base  which  is  used  for  this  research  effort.  The  INTREX 
programmers  have  written  numerous  programs  to  facilitate  access  to 
the  data  base. 

The  cataloging  record  for  each  document  of  the  collection  com¬ 
prises  many  of  the  standard  items  found  on  the  typical  catalog  cards  of 
conventional  libraries.  The  title  and  author,  publication  date,  publisher, 
and  number  of  pages  are  included.  In  addition,  an  INTREX  record  for  a 
document  includes  information  such  as  the  language  in  which  the  docu¬ 
ment  is  written,  the  abstract,  and  a  subject  index.  Each  particular  type 
of  information  is  nlaced  in  its  own  field  (title,  author,  publisher,  catalog 
number,  and  so  forth)  in  a  document  record.  Most  fields  in  a  document 
record  can  be  obtained  in  a  purely  routine  manner.  This  information 
comes  with  the  document  and  needs  only  to  be  transferred  to  the  docu¬ 
ment  record.  The  abstract,  when  included  in  the  record,  also  arrives 
with  the  document  and  has  been  written  by  the  author  or  an  abstracting 
agency.  Nearly  all  documents  of  the  INTREX  collection  include  an 
abstract.  If  not,  the  cataloger  substitutes  an  excerpt. 

In  a  document  record,  the  one  field  which  requires  analysis  of  the 
document  in  order  to  be  generated  is  the  subject  index.  Since  the  sub¬ 
ject  index  is  of  prime  interest  in  pre-indexing,  it  will  be  examined  here 
in  some  detail.  Of  49  possible  descriptive  fields  being  used  by  INTREX 
for  a  document,  the  process  of  generating  the  subject  index  field  alone 
consumes  over  half  of  an  indexer's  time  even  for  short  journal  articles. 
For  longer  documents,  the  subject  index  takes  a  larger  part  of  the  total 
time.  The  purpose  of  pre-indexing  by  machine  is  to  relieve  the  indexers 
of  the  pressing  nature  of  the  burden  of  writing  the  subject  index  for  the 
document. 

In  generating  the  subject  index  for  a  document,  the  indexers  may 
draw  upon  the  material  of  the  full  document,  including  the  title,  abstract, 
and  text.  The  subject  index  comprises  several  subject  terms  which 
describe  the  content  ()f  tiie  document.  Each  subject  term  may  range  in 
Length  from  a  single  word  to  one  or  more  noun  phrases  containing 


-3- 


several  words.  To  every  subject  term,  the  indexei  assigns  a  relevancy 
weight,  which  indicates  how  the  term  is  used  and  how  important  it  is  in 
the  document.  This  weighing  system  is  a  key  feature  of  Project  INTREX, 
since  it  is  hoped  that  the  weights  will  allow  more  accurate  retrieval  of 
pertinent  documents. 

A  subject  term  may  be  labeled  with  any  of  five  weights,  numbered 
from  0  through  4.  The  numbers  0  and  4  have  special  significance, 
whereas  the  numbers  1  through  3  indicate  specific  subject  matter  of  the 
document.  A  weight  of  1  indicates  that  the  subject  term  describes  the 
primary  topic  of  the  document;  weight  (2)  designates  a  secondary  topic; 
and  weight  (3), a  much  less  central  topic.  The  special  weight  of  4  is 
assigned  to  terms  representing  mathematical  tools,  instrumental  tools, 
or  applications  which  are  cited  in  the  document  but  which  are  not  central 
to  it.  The  weight  (0)  designates  a  generic  class  to  which  the  subject 
matter  of  a  document  belongs. 

Knowledge  of  the  indexing  behavior  of  the  INTREX  catalogers  pro¬ 
vides  support  for  the  concept  of  pre-indexing.  The  catalogers  for  Proj¬ 
ect  INTREX  are  trained  and  competent  in  matters  of  library  science. 
Nevertheless,  they  are  not  trained  as  subject  experts  in  the  area  of 
materials  science,  which  is  the  area  of  the  INTREX  document  collection. 
This  fact  may  influence  the  subject  indexes  which  the  indexers  generate. 
Although  in  writing  a  subject  index,  the  indexer  may  use  any  word  that 
she  desires  in  describing  the  document,  the  majority  of  words  used  will 
be  those  employed  by  the  author  of  the  document.  It  is  natural  that  an 
indexer  who  is  not  a  subject  expert  should  adopt  the  language  of  the 
author  in  indexing.  Since  the  title  and  abstract  of  a  document  usually 
provide  an  excellent  indication  of  the  subject  matter  of  that  document,  a 
large  number  of  the  subject  terms  of  a  subject  index  are  borrowed 
directly  from  the  title  and  abstract  of  the  document. 

Thus,  the  indexers  perform  derivative  indexing.  In  pure  derivative 

indexing,  the  words  selected  to  portray  the  subject  matter  of  a  document 

are  only  those  words  actually  used  by  the  author.  Salton  at  Cornell 
.  # 

University  has  experimented  with  automatic  derivative  indexing  from 

Salton,  G.  (ed),  "Information  Storage  and  Retrieval, "  Scientific  Report 
No.  ISR-11,  Department  of  Computer  Science,  Cornell  University, 

Ithaca,  New  York,  June,  1966. 


-4- 


title  and  abstract  only.  His  results  are  very  encouraging  to  the  notion 
of  pre-indexing.  Salton  found  that  automatic  derivative  indexing  from 
the  title  and  abstract  only  is  very  nearly  as  good  as  automatic  indexing 
from  full  text. 

Motivation  for  Pre- Indexing 

Two  factors  combine  to  indicate  the  possibility  of  pre-indexing. 

One  factor  indicates  that  pre-indexing  is  desirable;  the  other  factor, 
that  it  is  feasible.  First,  the  indexers  spend  a  large  part  of  their  time 
in  writing  the  subject  index.  It  would  be  desirable  if  this  process  could 
be  bypassed,  at  least  temporarily,  in  order  to  make  the  document 
available  to  INTREX  users  quickly.  Second,  Salton's  work  and  an  ex¬ 
amination  of  typical  subject  indexes  reveals  that  pre-indexing  is  feasible. 
Since  the  title  and  abstract  will  be  placed  with  the  document  index  into 
the  computer  data  base,  an  automatic  method  for  selecting  pertinent 
words  or  phrases  for  a  document  could  generate  a  pre-index  quickly  for 
the  document  before  the  subject  indexing  is  completed. 

Plan  of  the  Research 

Basically,  the  investigation  of  pre-indexing  divides  into  two  rather 
distinct  phases,  as  Fig.  1  shows.  In  Phase  I,  a  number  of  document 


Fig.  1  The  Pre-Indexing  System 


-5- 


re-.ords  is  automatically  analyzed  by  computer  in  order  to  build  a  dictio¬ 
nary  and  data  base  for  later  use  in  actual  pre-indexing.  Further,  the 
analysis  is  intended  to  provide  clues  for  designing  the  pre-indexing 
methods  to  be  used  in  Phase  II.  In  Phase  II  the  results  of  the  analysis 
are  applied  to  the  automatic  pre- indexing  of  new  documents.  Also  in 
Phase  II,  the  pre-indexes  that  are  generated  are  evaluated  fo  obtain  a 
measure  of  pre -indexing  quality. 


CHAPTER  II 


PHASE  I  -  DATA  GATHERING 

To  analyze  profitably  the  titles,  abstracts,  and  subject  indexes  of  a 
large  number  of  documents  in  order  to  gain  experience  and  data  for  later 
pre-indexing,  it  is  necessary  first  to  define  the  goals  of  pre-indexing. 

The  goal  of  automatic  pre-indexing  is  to  extract  from  the  title  and 
abstract  of  a  document  the  set  of  words  that  best  describes  the  contents 
of  the  document.  In  this  research,  it  is  assumed  that  those  words  in  the 
title  and  abstract  of  a  document  that  also  appear  in  the  human-generated 
subject  index  for  the  document  are  the  best  possible  choices  for  the 
words  of  the  pre-index.  Thus,  the  goal  of  pre-indexing  is  to  select  from 
the  title  and  abstract  all  those  words  which  the  human  indexers  will 
eventually  include  in  the  subject  index. 

Of  course,  the  pre-index  may  not  always  achieve  the  goal.  Figure  2 
shows  the  relationship  of  the  title-and-abstract  words  to  the  words  of  the 
subject  index  of  a  document.  The  target  set  of  words  for  a  pre-index  PI 
is  the  set  of  words  TA  f!  SI  (read:  the  set  of  words  common  to  Title/ 
Abstract  and  Subject  Index).  Words  which  are  eventually  included  in  the 
human-generated  subject  index  but  are  not  found  in  the  title  and  abstract 
of  the  document  are  ignored  in  this  study  (the  set  of  words  SI-SI  0  TA). 

Two  types  of  errors  are  possible  when  words  are  elected  for  a 
pre-index.  The  first  is  the  omission  of  some  words  that  should  be  in¬ 
cluded.  As  a  result  of  such  omissions,  the  pre-index  is  incomplete.  The 
second  type  of  error  is  the  inclusion  in  the  pre-index  words  that  should 
not  be  included.  This  second  type  of  error  decreases  the  relevance  of 
the  words  in  the  pre-index.  Definitions  for  two  measures  of  pre-index 
quality  which  indicate  how  well  the  two  errors  are  avoided  can  now  be 
formulated  as  follows  (see  Fig.  3): 

Completeness  -  The  percentage  of  words  in  TA  0  SI, 
the  target  pre-index,  that  are  actually  included  in  the 
pre-index 

Relevance  -  The  percentage  of  words  in  PI,  the  actual 
pre-index,  that  are  also  included  in  the  target  set, 

TA  0  SI 


-6- 


-8- 


If  both  completeness  and  relevance  of  a  pre-index  are  100  percent,  then 
the  pre-index  exactly  matches  the  target  set  of  words,  TA  H  SI. 

In  this,  research  the  standards  defined  above  are  based  on  assump¬ 
tions  that  must  be  remembered  when  trying  to  determine  the  true  value 
of  an  automatic  pre-index.  First,  the  measurements  use  the  human¬ 
generated  index  as  reference  standards  of  completeness  and  relevance. 
Second,  the  standards  assume  that  enough  information  is  contained  in 
the  title  and  abstract  of  a  document  to  extract  a  valid  pre-index  from 
only  those  two  sources.  Unfortunately,  the  quality  of  the  human  indexing 
cannot  be  evaluated  objectively  until  the  INTREX  catalog  is  used  opera¬ 
tionally.  In  the  meantime  it  appears  safe  to  assume  that  the  human 
index  is  a  valid  index  and  reference  standard.  An  examination  of  subject 
indexes  for  several  documents  shows  that  most  of  the  words  of  the  sub¬ 
ject  index  appear  in  the  title  and  abstract  of  a  document.  Furthermore, 
the  work  of  Salton  at  Cornell  indicates  that  use  of  only  the  title  and 
abstract  of  a  document  is  a  good  procedure  for  generating  a  derivative 
index  by  machine.  There  will  be  occasion  to  question  the  validity  of  the 
two  measures  of  pre-indexing  later,  but  for  the  time  being  they  appear 
to  provide  a  reasonable  basis  for  an  investigation  of  pre-indexing 
methods. 

Now  what  remains  to  be  determined  is  a  statistical  basis  for  a 
pre-indexing  system.  The  next  section  desc ribes  the  analysis  of  many 
documents  which  already  have  a  subject  index,  as  well  as  a  title  and 
abstract,  so  that  some  decision  rules  may  be  devised  for  the  selection 
of  words  for  a  pre-index. 

The  Analysis  of  Documents  Already  Having  Subject  Indexes 

The  purpose  of  analyzing  document  records  which  already  contain 
subject  indexes  is  twofold.  First,  the  analysis  should  yield  information 
for  the  formulation  of  possible  decision  rules  for  later  auto-pre-indexing, 
and  second,  the  analysis  will  result  in  a  dictionary  of  all  words  that  have 
appeared  in  the  titles  and  abstracts  of  all  the  documents  analyzed.  Such 
a  dictionary  is  employed  in  the  pre-indexing  of  new  documents. 

In  this  research,  80  documents  already  having  subject  indexes,  as 
well  as  titles  and  abstracts,  were  analyzed  by  computer  according  to  the 
techniques  formulated  below.  The  subject  index  of  each  document 


-9- 


lubject  term 

Fig.  4  Typical  Subject  Index 


of  pofawiu'i  j  metal  j 


Fig.  5  Typical  Title  and  Abitroct 


-10- 


comprises  a  list  of  subject  terms.  A  term  may  include  one  word  or 
several  words  combined  into  a  grammatical  phrase.  A  representation 
of  a  typical  subject  index  is  shown  in  Fig.  4,  and  representations  of 
the  title  and  abstract  are  given  in  Fig.  5.  In  order  to  compare  the 
words  in  the  title  and  abstract  with  the  words  in  the  subject  index,  the 
analysis  system  in  the  Phase-I  part  of  the  pre-indexing  process  must 
convert  title  and  abstract  into  one  list  of  words  and  the  subject  index 
into  another.  The  converted  lists  are  shown  in  Fig.  6.  The  list  for 


TA  SI 


Effect 

0) 

of 

positron 

magnetic 

•  Title  Words 

annihilation 

• 

: 

> 

potassium 

. 

potassium 

- 

The 

--(01  J 

angular 

magnetic 

\ 

correlation 

>  Abstract  Words 

field 

of 

orientation 

J 

J 

• 

i 

1 

• 

' 

■ 

• 

• 

• 

1,f  Subject  term 
ond  weight 


-A 

2  Subject  term 
and  weight 


rig.  6  TA  List  ond  51  List 

the  title  and  abstract  words  is  called  TA  ;  the  list  for  subject  index 
words,  SI.  The  purpose  of  these  lists  is  to  allow  simple  counts  of  the 
usage  of  words  in  the  title,  abstract,  and  subject  index. 

In  both  the  title  and  the  abstract  sections  of  TA,  each  different 
word  is  entered  only  once,  regardless  of  the  number  of  tokens  of  that 
word  that  actually  occur  in,  for  example,  the  abstract  of  a  document. 
For  instance,  the  word  "of"  may  occur  four  times  in  the  abstract  of  a 


-11- 


document.  However,  "of"  will  appear  only  once  in  the  abstract  section 
of  TA.  If  the  word  "of"  also  occurs  in  the  title  of  the  document,  then 
"of"  will  also  be  entered  once  in  the  title  section  of  TA.  In  SI,  each 
subject  term  is  considered  separately.  A  subject  term  is  reduced  to  a 
list  of  words  in  a  manner  similar  to  that  for  the  listing  of  the  title.  The 
word  lists  for  all  subject  terms  are  placed  sequentially  in  SI,  and  each 
subject  term  fills  its  own  block  of  SI.  Thus,  the  word  "of"  may  occur 
more  than  once  in  the  entire  Si  list,  but  only  once  in  each  of  several 
subject  terms.  Each  subject  term  is  also  marked  with  its  weight  number . 

Two  definitions  help  in  understanding  the  counting  procedures  used 
in  analyzing  documents.  First,  an  appearance  of  a  word  is  the  occur¬ 
rence  of  that  word  in  either  the  title  or  abstract.  A  word  may  have  only 
one  title  appearance  and  one  abstract  appearance  per  document.  Note 
that  title  appearances  are  counted  separately  from  abstract  appearances. 

Given  that  a  word  appears  in  a  document  title  or  abstract,  then  the 
number  of  times  that  the  word  occurs  in  SI  of  the  same  document  is  de¬ 
fined  as  the  usage  of  the  word.  A  word  may  have  a  usage  count  greater 
than  one  since  the  word  may  be  used  in  several  subject  terms.  The 
usage  count  of  a  word  having  title  appearances  is  kept  separately  from 
the  usage  count  of  the  came  word  also  having  an  abstract  appearance. 

This  distinction  has  been  made  because  the  significance  of  a  word  appear¬ 
ing  in  the  abstract  is  different  from  the  significance  of  the  same  word 
appearing  in  the  title  of  a  document. 

The  usage  for  each  word  is  also  broken  down  imc  usage  by  the 
weight  of  the  subject  terms  in  which  it  occurs.  Thus,  a  word  may  have 
;  rntal  usage  of  three,  with  a  usage  of  two  in  weigh'c-2  terms  and  a 
usage  of  one  in  weight-4  terms.  This  distinction  of  usage  by  weight  is 
made  to  determine  if  the  weight  usage  of  a  word  provides  any  clues  for 
selecting  words  for  a  pre-index.  Results  indicate  that  weight  usage  is 
not  particularly  significant.  Only  the  aggregate  usage  is  meaningful. 

At  first  glance  it  may  appear  somewhat  arbitrary  to  permit  a  word 
to  have  only  one  appearance  in  a  title  or  abstract.  However,  allowing 
multiple  appearances  would  create  a  difficult  problem  m  counting  the 
usage  for  a  particular  word.  With  multiple  appearances,  it  would  be 
impossible  to  determine  just  which  appearance  is  responsible  for  the 
usage  of  that  word  in  the  subject  index.  Moreover,  the  important 


-12- 


statistic  is  not  so  much  ho*  many  tokens  of  a  word  appear  in  the  title 
or  abstract,  but  rather,  given  that  the  word  appears  at  all,  how  likely 
is  that  word  to  be  used  in  the  subject  index. 

Building  a  Dictionary 

In  order  to  prcvide  a  future  basis  for  judging  the  significance  of 
words  for  pre-indexing,  the  data  that  are  collected  for  appearances  and 
usage  of  the  words  of  the  title  and  abstract  must  be  stored  in  a  dictio¬ 
nary.  As  each  new  document  record  is  analyzed,  the  words  from  the 
title  and  abstract  of  the  document,  along  with  the  appearance  and  usage 
data  for  those  words,  are  added  to  the  dictionary.  Thus,  the  dictionary 
will  be  a  list  of  all  words  that  have  appeared  in  a  title  or  abstract  to 
date,  along  with  cumulative  data  about  appearances  and  usage.  In  the 
data  portion  of  the  dictionary,  title  data  and  abstract  data  are  separated 
in  the  same  manner  that  the  data  were  collected,  although  the  dictionary 
includes  each  word  only  once. 

As  stated  previously,  80  document  records  were  analyzed  auto¬ 
matically  by  computer  in  the  manner  just  described.  Then  the  data 
contained  in  the  dictionary  were  inspected  in  order  to  learn  about  the 
statistical  basis  for  the  usage  of  words  in  the  subject  index.  Note  that 
at  this  stage,  the  data  that  has  been  collected  merely  represent  the 
behavior  of  the  human  indexers. 

One  item  of  interest  is  the  number  of  words  that  the  dictionary 
contains,  since  the  number  of  words  in  the  dictionary  may  influence  the 
usefulness  of  the  dictionary  for  pre-indexing.  Figure  7  shows  the  num¬ 
ber  of  words  in  the  dictionary  as  a  function  of  the  number  of  document 
records  analyzed.  The  size  of  the  dictionary  is  growing  less  rapidly  in 
the  region  of  80  documents  than  it  was  in  the  first  few  documents  of  the 
collection.  New  words  were  added  to  the  dictionary  at  the  rate  of 
approximately  35  per  document  during  the  analysis  of  the  first  20  docu¬ 
ments.  After  80  documents,  the  dictionary  grew  at  the  rate  of  24  new 
words  per  document,  but.  h  a  growth  curve  indicates  that  the  dictionary 
includes  many  common  words  after  only  a  small  sample  of  documents 
has  been  analyzed,  but  that  new  w<  rds  will  be  encountered  frequently  in 
new  documents.  Later,  in  the  design  of  a  pre-indexing  scheme  for  new 


-13- 


documents,  it  will  be  necessary  to  provide  for  those  words  that  are  not 
included  in  the  dictionary,  but  appear  in  the  title  or  abstract  of  a  new 
document. 


Fig.  7  Size  of  Oictionory  as  a  Function  of  the  Number  of  Documents  Analyzed 


The  dictionary  does  not  include  all  words  that  have  been  used  in 
subject  index  terms.  The  dictionary  is  intended  to  provide  information 
on  the  selection  of  words  from  the  title  and  abstract.  Figure  8  shows 


Fig.  8  Set  of  Worth  Included  in  the  Dictionary  Dota  ftase  (SHoded  Area) 


again  the  relationship  of  the  set  of  title  and  abstract  words  to  the  set  of 
subject  index  words.  All  the  words  in  the  shaded  portion  are  filed  in 
the  dictionary.  They  include  all  title  and  abstract  words.  The  words 


-14- 

in  the  unshaded  portion  are  not  included.  Although  the  unshaded  set  of 
words  may  well  be  good  words  for  inclusion  in  a  subject  index  for  future 
documents,  it  is  not  the  purpose  of  the  dictionary  merely  to  list  such 
words.  Since  the  words  represented  by  the  unshaded  portion  of  Fig.  8 
are  not  directly  connected  with  the  words  of  the  title  and  abstract,  the 
words  of  the  unshaded  portion  are  omitted. 

Possible  Pre-Indexing  Criteria 

One  possible  clue  for  the  selection  of  a  word  for  a  pre-index  is 
the  frequency  of  appearance  of  the  word  in  document  records  already 
analyzed.  The  frequency  of  appearance  of  a  word  refers  to  the  percent¬ 
age  of  the  document  records  analyzed  in  which  the  word  has  made  an 
appearance.  Thus  if  a  word  has  appeared  in  40  of  80  abstracts,  then 
the  word  has  a  50  percent  frequency  of  appearance  in  abstracts.  The 
same  word  may  have  a  frequency  of  appearance  of  only  ten  percent  in 
titles.  Figure  9  shows  the  number  of  words  from  the  dictionary  in 


Percentile 

of 

Number  of  words 

in  percentile 

Frequency 

Title 

Abstract 

100 

0 

2 

90-99 

0 

3 

80-89 

0 

2 

70-/9 

0 

1 

60-69 

1 

3 

50-5? 

0 

3 

40-49 

0 

3 

30-39 

3 

6 

20-29 

0 

16 

10-19 

6 

71 

0-10 

410 

2009 

Fig.  9  Frequency  of  Appearance  for  Words  in  the  Dictionary 
Based  on  80  Document  Records 


-15- 


various  percentiles  of  frequency  for  both  title  appearances  and  abstract 
appearances.  High-frequency  words  tend  to  be  function  words  --  mostly 
prepositions  and  articles.  Such  words  convey  little  about  the  content  of 
a  document.  Titles  tend  not  to  include  high-frequency  words. 

Another  possible  criterion  for  p  re -indexing  is  the  usage  rate  of 
words  in  the  dictionary.  Given  that  a  word  has  appeared  in  a  title  (or 
abstract),  then  the  usage  rate  of  a  word  is  the  ratio  of  the  number  of 
times  it  is  used  in  subject  indexes  to  the  number  of  appearances  in  titles 
(or  abstracts).  That  is, 

usage  rate  =  (cumulative  usage)/(no.  of  appearances). 

The  usage  rate,  then,  is  just  the  average  usage  of  the  word  per  appear¬ 
ance.  A  word  has  two  usage  rates.  One  usage  rate  is  based  on  its 
appearances  in  titles;  the  other,  on  its  appearances  in  abstracts.  A 
high  usage  rate  for  a  word  means  that  whenever  the  word  appears  in  a 
title  (or  abstract)  it  has  a  high  probability  of  being  used  in  the  subject 
index. 

Thus,  there  are  two  main  types  of  information  available  for  words 
in  the  dictionary -- usage  rate  and  frequency.  Figure  10  shows  the  usage 
rate  of  the  words  in  the  dictionary  as  a  function  of  frequency  of  appear¬ 
ance.  Since  the  usage  of  a  word  can  be  higher  than  the  number  of  appear¬ 
ances,  the  usage  rate  has  been  truncated  at  a  maximum  value  of  1.  The 
dashed  lines  in  Fig.  10  are  the  average  usage  rates  as  a  function  of  fre¬ 
quency.  Each  word  in  the  dictionary  is  actually  represented  by  a  point 
somewhere  on  the  two-dimensional  graphs.  (The  points  that  are  shown 
in  the  figure  are  merely  dummy  points  for  purposes  of  illustration.  ) 

The  data  represented  in  Fig.  10  are  an  aggregate  of  the  usage  data 
for  the  different  weight  usage  counts.  Although  the  usage-count  data 
that  is  stored  in  the  dictionary  is  separated  by  weight  usage,  the  usage 
rates  shown  in  Fig.  10  are  determined  from  the  total  usage  counts  for 
all  weights.  A  graph  showing  usage  rate  by  the  individual  weights  yields 
similar  results.  With  title  words,  however,  most  of  the  usage  occurs 
in  weight- 1  subject  terms.  Title  words  have  an  extremely  high  proba¬ 
bility  of  being  used  in  subject  terms  --  over  90  percent.  The  only  pecu¬ 
liarity  of  abstract  word  use  is  that  high-frequency  words  are  seldom 
used  in  terms  of  weight  (0)  or  weight  (4),  indicating  that  terms  of  such 


6- 


Usage  * 


Rate 


1.00 

0.75 


total  usage  \ 
o.  of  appearances  j  0.50 


X 

0.25  ~ 


0.00 


x—  x- 


20 


\ 


AVERAGE  USAGE  RATE 


40 


60 


80 


no,  of  appearances'^ 
no.  of  documents  J 


(a)  Title  Words 


100 

Frequency 
of  Appearance 
(percent) 


Fig.  10  Average  Word  Usage  Rate  as  a  Function  Frequency  of  Appearance 


-  17- 


weights  tend  to  be  quite  specific.  High-frequency  "function"  words, 
such  as  prepositions,  are  used  in  phrases  of  other  weights  in  order  to 
make  such  terms  readable  noun  phrases,  for  example,  "positron 
annihilation  in  potassium". 

There  is  a  useful  characteristic  of  word  usage  that  does  not  show 
in  Fig.  10,  but  is  clearly  demonstrated  in  Appendix  A.  Very  few  low- 
frequency  words  have  usage  rates  at  intermediate  levels  near  the  aver¬ 
age  usage  rate.  The  vast  majority  of  low-frequency  words  are  either 
used  very  seldom  or  very  often. 

Thus,  the  dictionary  contains  a  large  number  of  words  that  the 
indexers  have  used  in  the  subject  indexes  of  documents.  However,  the 
dictionary  also  contains  a  large  number  of  words  that  the  indexers  have 
consistently  failed  to  use  in  subject  indexes.  Such  words  include  verbs, 
which  never  appear  in  subject  terms.  Hence,  the  dictionary  contains 
information  that  will  be  useful  when  dictionary  words  are  encountered 
in  pre-indexing  new  documents.  When  a  word  encountered  in  a  new 
document  is  found  in  the  dictionary,  the  data  on  usage  rates  will  indicate 
whether  the  human  indexers  have  considered  the  word  as  useful  for  in¬ 
dexing.  Of  course,  words  that  are  not  in  the  dictionary  will  be  discov¬ 
ered  in  new  documents.  Rules  for  handling  both  known  and  new  words 
will  be  necessary  in  generating  a  pre-index. 

The  next  chapter  describes  the  actual  methods  to  be  used  for 
pre- indexing. 


CHAPTER  III 


THE  DEVELOPMENT  OF  PRE-INDEXING  SCHEMES 

All  the  automatic  pre-indexes  that  have  been  evaluated  have  a 
very  simple  form.  Basically,  the  pre-index  is  a  list  of  words  which  is 
intended  to  represent  the  content  of  a  document.  All  the  pre-indexes 
generated  are  the  result  of  simple  word-by-word  selection  procedure:? 
to  choose  words  from  the  titles  and  abstracts  of  documents.  Only  in 
one  of  three  methods  examined  are  words  considered  in  context.  A 
random  sample  of  30  documents  with  both  titles  and  abstracts  has  been 
used  for  the  testing  of  pre-indexing. 

The  Form  of  the  Title  and  Abstract  for  Pre-Indexing 

All  pre-indexing  methods  tested  in  this  research  operate  on  a  list 
of  the  title  and  abstract  words  in  the  document  to  be  indexed.  In  pre¬ 
paring  a  title  and  abstract  of  a  document  for  pre-indexing,  we  have  made 
a  slight  change  from  the  form  of  the  TA  list  as  described  in  the  pre¬ 
ceding  chapter.  For  pre-indexing,  a  single  list  of  title  and  abstract 
words  in  a  document  is  prepared,  but  now  the  list  gives  all  words  in 
exact  order  of  their  occurrence,  including  multiple  occurrences.  Such 
a  listing  preserves  information  on  context,  which  may  be  useful  for  pre¬ 
indexing. 

Three  Methods  of  Pre-Indexing 

Three  different  algorithms  for  pre-indexing  from  a- title  and 
abstract  word  list  have  been  developed.  The  three  methods  have  some 
common  features.  The  similarities  in  the  methods  lie  in  the  handling 
of  title  words  and  in  the  handling  of  abstract  words  not  found  in  the  dic¬ 
tionary.  The  primary  differences  in  the  methods  are  in  the  way  that 
the  dictionary  data  are  used. 

For  all  methods,  the  words  of  the  title  of  the  document  are  includ¬ 
ed  in  the  pre-index.  The  data  contained  in  the  dictionary  indicates  that 
including  the  title  words  is  a  very  sound  decision  rule.  From  the  data 
contained  in  the  dictionary  it  is  found  that  title  words  have  at  least  a 


-19- 


90  percent  chance  of  being  used  in  the  subject  index  of  a  document; 
hence  title  words  should  definitely  be  included  in  a  pre -index. 

All  methods  of  pre -indexing  face  the  problem  of  new  words, 
that  is,  the  occurrence  in  title  or  abstract  of  a  word  which  is  not  con¬ 
tained  in  the  dictionary.  All  new  words,  clearly,  will  be  low-frequenc 
words,  since  the  new  words  have  not  appeared  in  any  documents  pre¬ 
viously,  The  usage-rate  averages,  as  graphed  in  Fig.  10  of  Chap¬ 
ter  II,  show  that  a  low-frequency  word  in  an  abstract  has  only  about  a 
30  percent  chance  of  being  used  in  the  human-generated  subject  index 
of  the  document.  Thus,  when  a  new  word  appears  in  the  abstract  of  a 
document,  there  is  no  valid  reason  to  include  it  in  the  pre -index. 
Nevertheless,  if  all  new  words  are  excluded,  then  many  of  the  words 
that  belong  in  the  pre -index  will  be  omitted.  As  noted  previously, 
even  when  a  dictionary  comprising  the  words  from  SO  documents  is 
used,  over  20  new  words  are  encountered  in  the  title  and  abstract  of 
a  new  document.  Therefore,  it  has  been  decided  that  in  all  three  pre  - 
indexing  methods  a  new  word  will  he  included  if  it  appears  at  least 
twice  in  the  title  and  abstract.  A  manual  inspection  of  several  docu¬ 
ment  records  indicates  that  if  the  word  is  used  at  least  twice,  then 
there  is  a  good  chance  that  the  word  represents  something  important 
to  the  content  of  the  document.  New  words  which  occur  only  once  in 
the  title  and  abstract  appear  much  more  likely  to  be  only  filler  and 
not  central  to  the  content  of  the  document.  Actual  attempts  at  pre- 
indexing  support  the  notion  that  only  a  single  occurrence  of  a  new 
word  is  insufficient  for  the  word  to  be  included  in  the  pre -index. 

The  three  methods  differ  mostly  in  the  handling  of  words  that 
are  contained  in  the  dictionary.  The  primary  information  available 
about  words  in  the  dictionary  is  the  frequency  of  appearance  of  the 
word  and  the  usage  .'ate  of  the  word.  For  example,  the  dictionary 
contains  the  information  that  the  word  "the"  has  appeared  in  80  ab¬ 
stracts  and  has  a  cumulative  usage  count  from  abstracts  of  27  In 
weight-1  terms,  13  in  weight -2  terms,  46  in  weight-3  terms,  22  in 
weight -4  terms,  but  only  1  in  weight -0  terms.  The  three  methods 
use  this  information  in  different  ways.  However,  all  usage  data  is 
aggregated  over  all  subject -term  weights.  No  use  is  made  of  the 
breakdown  by  individual  weights.  Thus  only  the  aggregated  usage 


-20- 


count  of  109  is  retrieved  from  the  dictionary.  From  the  number  of 
appearances  in  abstracts  and  the  total  usage  from  abstracts,  it  is 
then  known  that  the  word  "the"  has  a  frequency  of  100  percent  in  ab¬ 
stracts  and  a  usage  rate  greater  than  1,0. 

Method  I  -  Usage  Rate  Only 

The  first  method  considers  only  the  usage  rate  for  a  word  which 
is  found  in  the  dictionary.  Words  with  a  high  usage  rate  are  selected 
for  the  pre -index.  Words  with  a  low  usage  rate  are  excluded  from  the 
pre-index.  Figure  11  illustrate s  method  I.  Words  with  usage  rates 
falling  in  the  shaded  area  are  included  in  the  pre -index  for  the  docu¬ 
ment.  The  exact  level  of  usage  rate,  R,  that  is  required  f.tr  a  word 
to  be  included  in  the  pre -index  is  tested  at  three  different  threshold 
levels  to  determine  what  effect  the  usage  rate  threshold  has  on  the 
completeness  and  relevance  of  a  pre -index.  The  three  threshold 
levels  of  the  usage  rate,  R,  that  have  been  tested  are  0.25,  0.50, 
and  0.75. 

Method  II  -  Usage  Rate  and  Frequency  Thresholds 

In  the  second  method,  not  only  the  usage  rate  of  a  word  is  con¬ 
sidered,  but  also  the  frequency  of  appearance  of  the  word  is  con¬ 
sidered.  Since  very  high-frequency  words  tend  to  carry  little  in  - 
formation,  such  words  may  possibly  be  excluded  from  the  pre -index. 
Therefore,  in  method  II  the  pre -index  will  include  only  low-frequency, 
information-bearing  words .  Figure  12  illustrates  the  selection  cri¬ 
teria  for  method  II.  Words  whose  frequency  and  usage  rate  fall  in 
the  shaded  area  are  included  in  the  pre -index.  All  other  dictionary 
words  are  excluded.  In  tests  of  this  method,  the  usage -rate  thresh¬ 
old,  R,  has  been  held  constant  at  0.5  so  that  ..he  effect  of  varying  the 
frequency  threshold,  F,  may  be  observed.  The  frequency  threshold 
has  been  tested  at  25  percent,  50  percent,  and  75  percent. 

Method  III  -  Context  Decisions 

The  third  method  of  pre -indexing  is  somewhat  more  involved. 

In  the  other  two  methods,  a  word  is  either  selected  or  not  selected 
for  the  pre -index  purely  on  the  basis  of  the  dictionary  data  for  that 


USAGE  HATE 


-21- 


Fig.  11  Word  Inclusion  Region  for  Method  I 


Fig.  12  Word  inclusion  for  Method  II 


Fig.  13  Word  Zones  for  Method  III 


-22- 


word  only.  The  third  method  of  pre -indexing  recognizes  an  "ambigu¬ 
ous"  class  of  words,  for  which  the  pre -indexing  method  considers 
neighboring  words  before  making  a  decision  to  include  or  exclude  the 
ambiguous  word.  Figure  13  shows  two  thresholds,  R1  and  R2  on  usage 
rate  and  a  threshold  frequency  F,  being  used  in  the  third  method  of 
pre -indexing.  Only  words  with  a  frequency  below  the  frequency  thresh 
old  F  are  considered  as  candidates  for  the  pre -index.  Words  whose 
usage  rates  fall  in  the  zone  between  the  two  usage  rate  thresholds  R1 
and  R2  are  considered  to  be  in  the  ambiguous  zone.  Weak  words, 
which  have  usage  rates  lower  than  Rl,  are  immediately  excluded  from 
the  pre-index.  Strong  words,  whose  frequency  and  usage  rates  fall 
in  the  cross-hatched  region  above  R2  in  Fig.  13,  are  immediately  in¬ 
cluded  in  the  pre -index.  Words  with  usage  rates  between  Rl  and  R2 
(the  shaded  region  of  Fig.  13)  are  marked  for  later  decision.  If 
such  an  ambiguous  word  neighbors  a  strong  word  on  either  side, 
then  the  ambiguous  word  is  included  in  the  pre -index.  However,  if 
the  ambiguous  word  is  surrounded  only  by  weak  words  then  the  am¬ 
biguous  word  is  excluded  from  the  pre -index.  This  method  was  de¬ 
vised  purely  as  an  experiment  to  aid  in  the  handling  of  words  whose 
usage  rate  is  at  an  intermediate  level  and  for  new  words  whose  usage 
rate  is  unknown.  For  such  words  there  is  no  sound  basis  for  a  se¬ 
lection  decision,  so  content  is  considered  in  this  method.  In  essence, 
this  method  recognizes  simple  phrases  that  are  delimited  by  weak 
words  such  as  articles,  prepositions,  and  verbs.  For  example,  from 
the  phrase  ".  .  .the  dynamic  pulse  hysteresis  in.  .  .  ",  method  III  will 
select  the  words  "dynamic  pulse  hysteresis." 

In  method  III  as  in  the  other  two  methods,  if  a  new  word  (a  word 
not  contained  in  the  dictionary)  occurs  at  least  twice  ir.  the  title  and 
abstract  of  a  document,  then  that  word  is  included  in  the  index.  If 
such  a  new  word  occurs  only  once,  it  is  considered  to  be  an  ambigu¬ 
ous  word  and  is  included  only  if  it  has  a  strong  neighbor.  This  de¬ 
cision  rule  was  adopted  because  a  manual  inspection  of  many  abstracts 
indicated  that  although  many  meaningful  words  may  appear  only  once 
in  the  title  and  abstract  of  a  document,  such  words  usually  seem  to 
occur  next  to  o^her  meaningful  words.  Through  consideration  of  single 
occurrence  new  words  as  ambiguous  rather  than  weak,  fewer 


-23- 


meaningful  wordB  are  omitted  from  a  pre  -index.  Later  pre -indexing 
trials  showed  that  the  primary  difference  between  method  II  and 
method  III  lies  in  the  treatment  of  new  words  with  single  occurrences. 
Recall  that  such  words  are  excluded  from  a  method  II  pre -index. 

In  tests  of  method  III  the  frequency  threshold  was  held  constant 
at  75  percent  so  that  the  effect  of  varying  the  ambiguous  usage  zone 
Could  be  observed.  For  testing,  the  ambiguous  zone  was  set  at  three 
different  ranges  - -0. 2  to  0. 5,  0.3  to  0.5,  and  0.3  to  0.6. 

Testing  the  Pre -Indexing  Methods 

Thirty  documents  were  automatically  pre -indexed  in  order  to 
test  the  above  methods.  To  facilitate  experimentation,  the  automatic 
pre -indexing  system  was  developed  to  pre -index  each  document  by 
all  three  methods  with  only  one  pass  on  a  document.  To  provide 
further  experimental  efficiency,  the  pre -indexing  system  also  pre- 
indexed  under  three  different  parameter  sets  for  each  method.  In 
addition,  the  pre -indexing  system  also  compared  the  different  trial 
pre -indexes  of  each  document  to  the  human -gene rated  subject  index 
in  order  to  yiela  immediate  values  of  completeness  and  relevance  for 
the  pre -indexes.  Completeness  and  relevance  are  the  standards  of 
quality  as  defined  in  the  previous  chapter.  The  next  chapter  describes 
the  results  of  the  tests  of  pre -indexing. 


CHAPTER  IV 


RESULTS  OF  PRE -INDEXING  TRIALS 

To  test  the  pre -indexing  methods  that  were  described  in  the 
preceding  chapter,  30  documents  having  titles  and  abstracts  were 
used.  For  each  of  the  documents,  nine  different  pre -indexes  were 
generated.  The  nine  pre -indexes  are  the  result  of  using  the  three 
methods,  each  with  three  parameter  sets.  The  testing  demonstrated 
that  the  methods  cause  significant  differences  in  pre -indexing  results , 
although  changing  parameters  within  a  method  has  little  effect  on  the 
quality  of  a  pre -index  as  measured  by  the  standards  of  completeness 
and  relevance, 

A  simple  way  to  view  the  results  for  a  pre -indexing  method  is 
to  plot  the  completeness  and  relevance  for  each  document  pre -index 
as  a  point  on  a  graph.  The  closer  that  a  document  pre -index  is  to 
100  percent  in  both  completeness  and  relevance,  the  better  the  pre¬ 
index,  at  least  in  a  primitive  way.  The  results  of  pre-indexing  trials 
indicate  that  there  is  cause  to  question  the  measure  of  completeness 
as  it  has  been  defined  previously.  However,  for  the  time  being, 

Figs.  14,  15,  and  16  show  the  plots  of  completeness  and  relevance 
for  pre -indexes  by  methods  I,  II,  and  III  respectively.  Although  three 
different  parameter  sets  were  tested  for  each  method,  a  representa¬ 
tive  parameter  set  has  been  selected  to  illustrate  the  results  of  each 
method.  Figure  17  summarizes  the  results  for  all  three  parameter 
sets  for  each  of  the  three  methods.  In  Fig.  17  only  the  average  re¬ 
sults  of  completeness  and  relevance  are  plotted.  Appendix  B  gives 
the  results  of  all  pre -indexing  trials  in  tabular  form , 

Subjective  Evaluation  of  the  Informational  C ontent  of  Pre  -Indexe s 

One  of  the  first  questions  that  comes  to  mind  is,  "Does  an  auto¬ 
matic  pre -index  appear  to  describe  the  subject  content  of  the  docu- 
ment')"  An  inspection  of  pro -indexes  for  several  documents  indicates 
that,  indeed,  the  pre -indexes  contain  reasonably  accurate  subject 
matter  from  the  title  and  abstract  of  documents.  For  example,  from 


-25- 


Porometer  used: 

U»og#-*at#  Threshold  0.50 


AVERAGE:  ^ 

COMPLETENESS  6*  p«rc«nt 
RELEVANCE  61  percent 

J. _ I _ I _ I _ I _ 1 _ J _ 1 _ L . I 

20  40  60  80  100 

RELEVANCE  (PERCENT) 

9.  14  Result!  of  Method  I  for  30  Documents 


Parameters  used: 

Utoge-fcate  Threshold  0*50 
Frequency  Threshold  75  percent 


x 


X 


AVtRAGC:  % 

COMPUTINGS  53  percert * 

KKlfVANCl  *2  perc •** 


1 - 1 - 1 - » - 1 - 1 - 1 _ Rill 

N  40  M  10  too 

RIllVANCI  iTltCINT) 


*  15  Results  of  Method  II  for  30  Documents 


-26* 


)00 


80 


60 


40  P 


20 


o1- 

0 


Parameter*  used; 

Uioge-Rote  Thresholds,  Rl  0.3;  R2  0.6 
Frequency  Threshold  75  percent 


x 


X 


X 


AVERAGE:  (g 

COMPLETENESS  59  percent 
RELEVANCE  57  percent 

_J _ i _ I _ _ _ i _ 1 _ i _ 1 _ i _ L. 

20  40  60  80  1  00 

RELEVANCE  (PERCENT) 


Fig.  16  Results  of  Method  III  for  30  Documents 


100 


80 


z 

kU 

u 


cn  60 
un 
UJ 

z 


5 
O  40 
U 


20 


METHOD  1 

\  c®; 


A  '  METHOD  m 


J^b\  method  n 

\  ®c| 


A, 6,  and  C  indicate  the  different 
parameter  sets  used.  Tne  values  of  the 
parameters  ore  included  in  Appendix  6. 


X 


_L_ 


20  40  60 

RELEVANCE  (PERCENT) 


80 


100 


g.  17  Comparison  of  Average  Results  for  all  Methods  on  30  Documents 


-27- 


the  title  and  abstract  of  an  article  about  "Dynamic  Pulse  Hysteresis 
in  Magnetic  Devices",  the  following  is  a  partial  list  of  terms  in  a  pre- 
index  generated  by  method  III. 

instantaneous  hysteresis 

magnetic 

risetime 

an  applied  pulse  field 

dynamic  pulse  hysteresis 

magnetic  devices 

quasi -static  hysteresis 

magnetic  damping  phenomena  by 

extended 

with 

The  pre -indexing  method  overlooked  some  words  that  are  pertinent, 
such  as  "ferrite  core",  but  in  general,  the  pre -index  contains  a  sig¬ 
nificant  number  of  the  important  words  from  the  title  and  abstract. 

Only  a  few  words  of  the  pre -index  words  are  not  pertinent  to  the  topic, 
such  as  "an",  "by",  "with",  and  "extended".  In  general  the  pre- 
indexing  methods  are  very  good  at  eliminating  verbs  from  pre -indexes. 
The  word  "extended"  is  included  because  it  has  a  high  usage  rate 
from  prior  usage  by  the  human  indexers,  who  probably  used  it  as  an 
adjective  rather  than  a  verb.  Overall,  however,  the  pre -indexing 
system  appears  to  do  an  effective  job. 

The  Role  of  Function  Words 

The  results  from  method  I  and  method  II,  as  illustrated  in 
Figs.  14  and  15,  apparently  show  that  method  I  is  far  superior  to 
method  II.  Both  methods  yield  very  nearly  the  same  measured  rele¬ 
vance;  however,  method  I  gives  a  completeness  measure  a  full  33 
percent  higher  than  method  II.  Yet  the  only  difference  in  the  two 
methods  is  that  method  II  eliminates  high-frequency  words  from  a 
pre-index,  such  as  "in",  "to",  "for",  "the",  and  "of".  Both  methods 
include  all  title  words  in  the  pre -index.  Both  methods  have  the  same 
usage  rate  threshold  for  words  found  in  the  dictionary.  Moreover, 
both  methods  treat  new  words  in  exactly  the  same  way.  But  method  II 
excludes  numerous  function  words  by  excluding  high-frequency  words. 
Such  function  words  have  little  informational  value.  In  fact  the  two 
most  frequent  words  that  still  carry  information  are  "magnetic"  and 


-28- 


'field",  with  frequencies  just  under  30  percent.  Words  of  such  low 
frequency  have  not  been  eliminated  from  pre -indexes.  Hov/ever, 
human  indexers  include  high-frequency  words  in  subject  index  terms 
in  order  to  make  noun  phrases.  {In  INTREX  each  word  of  a  noun 
phrase  is  being  placed  in  an  inverted  file  of  subject  words  and  its 
position  within  the  phrase  is  being  recorded.  The  purpose  of  this  pro¬ 
cedure  is  to  allow  users  of  the  INTREX  system  to  state  their  requests 
in  terms  of  noun  phrases,  if  they  wish.  INTREX  plans  to  test  and 
evaluate  the  merit  of  this  kind  of  capability.)  When  method  II  elimi¬ 
nates  such  high-frequency  function  words  from  a  pre -index,  the  com¬ 
pleteness  of  that  pre -index  is  reduced.  Nevertheless,  the  infor¬ 
mational  value  of  that  pre -index  is  unchanged,  since  the  excluded 
function  words  carry  no  information.  The  comparison  of  method  I 
with  method  II  shows  that  fully  30  percent  of  all  word  occurrences 
that  are  eventually  found  in  the  human-generated  subject  index  are 
merely  informationless  function  words.  Thus,  the  comparison  of  the 
two  methods  shows  that  the  measures  of  completeness  and  relevance 
unfortunately  do  not  fully  indicate  the  quality  of  a  pre -index.  How¬ 
ever,  if  the  problem  of  high-frequency  words  is  kept  in  mind,  then  the 
measures  of  completeness  and  relevance  still  give  at  least  a  feeling 
for  the  relative  results:  of  pre -indexing  trials. 

The  Effect  of  Varying  the  Usage -Rate  Threshold 

Within  both  methods  I  and  III  the  usage -rate  thresholds  are 
varied  to  determine  if  the  exact  setting  of  the  usage  rate  threshold 
affects  the  pre -index  significantly.  In  method  III,  where  high- 
frequency  words  are  excluded  from  the  pre -index,  the  settings  of  the 
usage -rate  thresholds  have  very  little  effect  on  the  completeness  and 
relevance  of  the  pre -indexes.  This  is  to  be  expected,  since  only  five 
percent  of  all  low-frequency  words  found  in  the  dictionary  have  usage 
rates  that  fall  in  the  intermediate  range  from  0.20  to  0.60,  the  range 
in  which  the  thresholds  were  varied.  Thus,  the  particular  setting  of 
the  usage -rate  threshold  has  negligible  effect  on  the  pre -indexes 
generated  by  method  III.  However,  in  method  I  the  setting  of  the 
simple  usage-rate  threshold  seems  to  have  a  much  greater  effect  on 
the  pre -indexes  generated  by  this  method.  This  is  not  surprising 


-29- 

when  it  is  remembered  that  method  I  includes  high-frequency  words 
in  the  pre-index;  their  inclusion  depends  on  the  usage  rates  of  such 
high  frequency  words.  Since  the  usage  rates  of  several  of  the  higher 
frequency  words  fall  in  the  range  of  trial  variation  of  the  usage  rate 
threshold  for  method  I,  the  completeness  and  relevance  are  more 
sensitive  to  the  usage -rate  threshold. 

The  Effect  of  New  Words  on  Pre -Indexing 

Examination  of  Fig.  17  shows  a  noticeable  difference  in  the  pre  ¬ 
indexing  results  between  method  II  and  method  III.  However,  from 
the  preceding  discussion,  it  appears  that  there  should  be  little  dif¬ 
ference  because  of  the  different  ways  of  setting  the  usage -rate  thresh¬ 
olds  for  the  two  methods,  since  both  methods  eliminate  high-frequency 
words.  Also,  recall  that  both  methods  include  new  words  that  appear 
at  least  twice.  The  difference  in  the  results  obtained  from  the  two 
methods  arises  from  the  different  way  of  handling  new  words  that 
occur  only  once  in  the  abstract  of  a  document.  In  method  II,  such 
words  are  considered  as  ambiguous  words  and  may  be  included  in  the 
pre -index  under  the  proper  circumstances.  By  including  such  single - 
occurrence  new  words,  method  III  has  a  higher  completeness  than 
method  II,  since  the  inclusion  of  any  word  in  a  pre -index  can  only 
help  the  measured  completeness  of  that  pre -index.  On  the  other  hand, 
some  of  the  new  words  that  method  III  selects  for  inclusion  in  the  pre- 
index  are  not  used  in  the  human -generated  subject  index;  hence  the 
relevance  of  method  II  pre -indexes  is  lowered  slightly. 

In  the  early  experimental  stages  of  this  research,  all  methods 
of  pre -indexing  included  all  new  words  regardless  of  the  number  of 
occurrences  of  that  word  in  the  title  and  abstract  of  the  document. 
However,  changing  the  decision  rule  to  require  at  least  two  occur¬ 
rences  of  a  word  for  definite  inclusion  in  a  pre -index  appears  to  have 
significantly  increased  the  quality  of  the  pre -indexes.  Under  all 
methods  of  pre -indexing,  the  relevance  of  pre -indexes  increased  by 
approximately  eight  percent,  while  the  completeness  suffered  by  only 
two  to  three  percent  when  the  decision  rule  was  changed  to  require  at 
least  two  occurrences. 


-30- 


The  Completeness  Relevanc e  Trade-off 

Further  examination  of  Fig.  17  clearly  shows  that  within  a  pre- 
indexing  method,  relevance  must  be  sacrificed  in  order  to  gain  com¬ 
pleteness.  Moreover,  comparing  method  II  and  method  III,  which 
differ  most  significantly  in  policy  toward  new  words,  one  also  finds 
that  the  same  trade-off  is  made  between  methods.  Such  is  to  be  ex¬ 
pected.  Adding  any  word  to  a  pre  -index  can  never  reduce  complete¬ 
ness.  In  fact  if  every  word  of  the  title  and  abstract  of  a  document  is 
included  in  the  pre-index,  then  the  completeness  for  that  pre -index 
will  be  100  percent.  But  the  relevance  of  that  pre -index  will  be  ap¬ 
proximately  3  5  percent,  since  only  that  percentage  of  words  in  the 
title  and  abstract  of  a  document  are  used  in  the  average  human¬ 
generated  subject  index.  (The  analysis  of  document  records  in 
Phase  I  showed  that  approximately  35  percent  of  the  words  of  the 
title  and  abstract  eventually  appear  in  the  human -gene rated  subject 
index.)  When  words  are  eliminated  from  a  pre  -index  in  order  to  in¬ 
crease  relevance,  there  is  the  statistical  probability  that  some  of  the 
eliminated  words  are  actually  necessary  to  maintain  completeness. 
Hence,  completeness  must  be  sacrificed  if  relevance  is  to  be  raised. 
Moreover,  as  is  found  by  comparing  method  I  with  method  II,  a  full 
30  percent  of  completeness  can  be  sacrificed  without  damaging  the  in¬ 
formational  content  of  a  pre -index  merely  by  removing  the  high 
frequency  words  from  the  pre -index. 

The  Effect  of  Dictionary  Size 

Even  with  a  dictionary  containing  all  the  words  from  the  titles 
and  abstracts  of  80  documents,  many  new  words  are  encountered  in 
the  pre -indexing  of  new  documents.  An  examination  of  the  curve  of 
dictionary  growth  in  Fig.  7  indicates  that  even  if  the  dictionary  were 
obtained  from  a  much  larger  document  base,  new  words  in  titles  and 
abstracts  would  continue  to  have  an  important  role  in  pre -indexes. 
However,  the  dictionary  does  contain  a  good  collection  of  40  very 
common  words,  which  appear  in  over  30  percent  of  all  document  ab¬ 
stracts.  Also,  an  examination  of  several  pre -indexes  reveals  that 
the  dictionary  contains  most  of  the  verbs  which  are  encountered  in 
typical  abstracts.  The  three  methods  of  pre-indexing  are  very  good 


-31- 


at  recognizing  and  excluding  verbs  from  pre -indexes,  for  example, 
forms  of  "to  be",  "to  report",  "to  have",  and  "to  discuss".  In  fact 
a  large  portion  of  the  selection  decisions  that  the  pre -indexing 
methods  must  make  about  words  found  in  the  dictionary  result  in  the 
exclusion  of  those  words,  rather  than  the  inclusion  of  the  words.  In 
addition,  most  decisions  must  be  made  about  common  words,  since 
such  words  appear  most  frequently .  Therefore,  the  dictionary  used 
in  this  research  is  of  sufficient  size  to  provide  a  reasonable  test  of 
the  postulated  pre -indexing  methods.  Only  a  very  much  larger  dic¬ 
tionary  would  alter  the  proportion  of  new  words  encountered,  and  it 
is  doubtful  that  the  results  would  be  substantially  different. 


CHAPTER  V 


CONCLUSIONS  AND  SUGGESTIONS  FOR  FURTHER  RESEARCH 

The  primary  conclusion  to  be  drawn  from  this  research  is  that 
automatic  pre  -indexing  by  machine  is  feasible  with  the  application  of 
simple  techniques.  Furthermore,  the  implementation  of  the  pre  - 
indexing  methods  developed  here  can  be  accomplished  with  compu¬ 
tational  efficiency  if  the  pre -indexing  system  is  designed  as  an  oper¬ 
ational  system  rather  than  as  an  experimental  system. 

The  first  step  in  the  design  of  an  operational  pre  -indexing 
system  is  to  select  a  suitable  pre-indexing  method.  It  is  the  opinion 
of  this  author  that  method  I  is  not  so  useful  for  pre -indexing  as  are 
the  other  two  methods.  Recall  that  method  1  allows  the  inclusion  of 
high-frequency  function  words  in  the  pre -index  for  a  document, 
whereas  the  other  two  methods  exclude  them.  In  a  pre -index  which  is 
a  list  of  descriptive  words,  function  words  have  no  place. 

The  remaining  pre-indexing  methods,  II  and  III,  present  a 
trade-off  between  completeness  and  relevance.  Method  III,  which  in¬ 
cludes  a  greater  percentage  of  the  new  words  that  it  encounters  than 
does  method  II,  tends  to  be  slightly  more  complete,  but  somewhat  less 
relevant. 

However,  it  is  necessary  before  one  judges  the  relative  merits 
of  methods  I  and  II  to  look  more  closely  at  the  figures  for  complete¬ 
ness  for  both  of  these  methods.  The  graph  of  Fig.  17  show?  method  II 
with  an  average  completeness  of  53  percent;  method  III,  59  percent. 
From  the  prior  examination  of  method  I,  recall  that  approximately  30 
percent  of  the  completeness  measured  for  a  pre -index  is  lost  if  high- 
frequency  words  are  removed.  That  particular  30  percent  of  com¬ 
pleteness  can  be  safely  ignored  in  considering  informational  content, 
since  the  high-frequency  words  removed  in  methods  II  and  III  are  all 
merely  function  words.  Thus  in  terms  of  residual  informational  con¬ 
tent,  the  true  average  completeness  of  methods  II  and  III  probably 
lies  in  the  range  of  80  percent  to  90  percent,  after  making  a  correction 
for  the  extraneous  30  percent.  Thus,  both  methods  II  and  III  appear  to 


-3  Z- 


-33- 


contain  a  substantial  portion  of  the  informational  content  of  the  title 
and  abstract  of  a  document. 

Nevertheless,  method  III,  which  includes  more  of  the  new  words, 
would  probably  be  the  best  pre -index  for  am  operational  system  be¬ 
cause  of  its  greater  completeness.  The  price  that  must  be  paid  for 
the  greater  completeness  is  lessened  relevance,  but  the  larger  pre  - 
index  generated  by  method  III  should  not  create  an  undue  burden. 

Another  distinction  between  method  II  and  method  III  is  that 
method  III  recognizes  and  provides  special  processing  for  words 
having  intermediate  usage  rates.  Because  a  small  dictionary  was 
used  in  the  pre -indexing  tests,  this  difference  in  methods  caused  only 
very  small  differences  in  results  from  method  II;  with  a  greatly  ex¬ 
panded  dictionary,  method  III  would  probably  become  even  more  use¬ 
ful. 

Streamlining  the  Pre -Indexing  System 

Much  can  be  done  to  streamline  the  pre -indexing  system  for 
either  operational  use  or  further  experimentation.  Most  of  the  pos¬ 
sible  improvements  should  be  made  in  the  organization  of  the  dictio¬ 
nary  files.  Presently,  the  files  contain  an  excessive  amount  of  data 
that  were  collected  in  the  automatic  analysis  of  documents.  At  the 
time  the  data  were  collected,  there  was  no  way  to  determine  justwhat 
data  would  be  useful.  From  an  examination  of  the  accumulated  data, 
however,  it  appears  that  the  breakdown  of  usage  information  by 
subject -term  weights  can  be  discarded,  and  only  an  aggregated  form 
of  usage  data  retained.  Such  a  condensation  would  reduce  the  size  of 
the  files  by  35  percent.  The  use  of  sorting  techniques  in  organizing 
the  dictionary  would  improve  the  efficiency  of  the  pre -indexing 
system.  Proper  organization  would  allow  the  use  of  faster  search 
techniques  in  dictionary  lookups.  Such  organization  of  the  dictio¬ 
nary  and  implementation  of  improved  search  would  be  necessary  for 
an  operational  system,  or  even  for  extensive  additional  research.  In 
a  pioneer  experimental  system  with  a  relatively  small  dictionary, 
such  refinements  are  not  justified. 


-34- 


Additional  streamlining  can  be  performed  on  some  of  the  pro¬ 
gramming.  Several  operations  that  at  first  appeared  necessary  for 
a  preliminary  experimental  system  are  not  clearly  superfluous. 

Applicability  of  Pre -Indexing  to  other  Subject  Areas 

All  of  the  documents  used  for  this  pre -indexing  research  fall  in 
the  area  of  materials  science  and  engineering.  Thus,  the  dictionary 
contains  the  special  vocabulary  used  by  scientists  and  engineers  of 
that  field.  A  question  arises  as  to  whether  the  dictionary  and  pre- 
indexing  system  developed  in  this  research  are  useful  in  other  subject 
areas.  Unfortunately,  there  is  no  document  collection  from  another 
field  available  for  direct  experimentation. 

However,  the  nature  of  the  pre -indexing  system  would  probably 
allow  it  to  be  effective  in  another  technical  field.  The  pre -indexing 
system  does  not  work  solely  by  choosing  recognizable  words  from  a 
title  and  abstract;  the  system  also  eliminates  many  common  words. 
In  fact,  most  word  decisions  made  by  the  system  are  to  eliminate  a 
word  rather  than  to  include  it.  Many  of  the  words  very  common  to 
materials  science  are  function  words  and  therefore  are  also  very  com 
mon  in  other  fields.  The  pre -indexing  system  is  good  at  eliminating 
such  words,  since  those  words  are  the  verbs  and  prepositions.  Hence 
the  dictionary  would  still  be  at  least  partially  effective  in  some  other 
field. 

Suggestions  for  Further  Research 

One  item  of  obvious  interest  is  the  effect  of  dictionary  size  on 
the  results  of  pre -indexing  trials.  The  dictionary  used  in  these  tests 
contained  approximately  2200  words.  A  much  larger  dictionary  may 
affect  the  results  in  two  ways.  First,  the  number  of  new  words  en¬ 
countered  in  pre -indexing  may  be  reduced.  Second,  the  distribution 
of  the  usage  rates  of  the  known  words  may  change,  so  that  the  setting 
of  the  usage -rate  thresholds  in  the  different  methods  may  become 
more  important.  For  example,  in  a  dictionary  built  from  80  docu¬ 
ments,  many  of  the  words  contained  in  the  dictionary  have  appeared 
in  documents  only  once.  Hence,  the  word  can  have  a  usage  rate  of 
only  either  l  .  0  or  0.0.  But  if  the  dictionary  were  to  be  based  on  the 


-35- 


analysis  of  400  documents  then  perhaps  several  of  the  low-frequency 
words  would  have  a  sufficient  number  of  appearances,  so  that  their 
usage  rates  could  fall  around  0.5.  An  analysis  of  the  data  contained  in 
the  dictionary  as  it  grew  to  its  present  size  indicates  that  any  signifi¬ 
cant  change  in  the  distributions  of  the  usage -rate  data  is  far  in  the 
future,  if  in  the  future  at  all.  A  comparison  of  a  dictionary  based  on 
35  documents  with  a  later  dictionary  based  on  80  documents  showed 
only  a  very  minor  change  in  the  distribution  of  usage  rates.  In  both 
cases,  the  usage  rates  fell  primarily  at  the  two  extremes. 

The  effect  of  reducing  the  number  of  new  words  is  also  hard  to 
predict.  Again  it  is  very  probable  that  a  very  much  larger  dictionary 
is  necessary  to  substantially  affect  the  number  of  new  words  en¬ 
countered  in  the  title  and  abstract  of  a  document.  Figure  7  shows 
that  the  number  of  new  words  per  new  document  is  dropping  slowly 
after  80  documents.  In  fact,  a  base  of  400  documents  might  be  neces¬ 
sary  to  decrease  the  number  of  new  words  per  document  from  over 
20  at  present  to  about  10. 

A  shortage  of  computer  time  prevented  tests  with  an  expanded 
data  base.  However,  first  a  streamlining  of  the  pre -indexing  system 
to  a  more  operational  form,  followed  by  testing  with  a  larger  dictio¬ 
nary,  would  no  doubt  be  worthwhile  as  a  prelude  to  making  such  a  pre- 
indexing  system  operational. 

The  Possibility  of  Using  Word  Stemming 

The  value  of  stemming  in  a  pre -indexing  system  was  not  tested 
during  these  experiments  .  However,  stemming  may  have  two  very 
useful  effects  on  a  pre -indexing  system.  One  immediate  effect  is 
to  reduce  the  size  of  the  dictionary  that  is  stored  as  the  data  base. 
Many  different  forms  of  a  root  with  various  endings  are  presently 
stored  separately  in  the  dictionary.  However,  by  stemming,  all 
words  that  have  the  same  root  can  be  lumped  together  in  one  dictio¬ 
nary  entry.  For  example,  the  entries  for  "magnetic"  and  "magne¬ 
tism"  can  be  merged.  Such  a  consolidation  of  data  significantly  re¬ 
duces  storage  requirements. 

Moreover,  the  stemming  of  words  also  decreases  the  number  of 
new  words  encountered  in  a  new  document.  With  stemming,  a  new 


-36- 


ending  on  a  known  root  would  be  considered  as  merely  another  in 
stance  of  the  known  root,  rather  than  as  a  completely  new  and  un 
known  word. 


APPENDIX  A 


BREAKDOWN  OF  WORDS  IN  DICTIONARY  BY 
USAGE  RATE  AND  FREQUENCY 

(Based  on  words  from  80  documents) 

>1.00  368  6-3--1-- 

0.90  -  0.99  -  -  . 

0.80-0.89  . 

0.70  -  0.79  1  . 

Usage  0.60  -  0.69  1  -------- 

Rate  0.50  -  0.59  3  . 

0.40  -  0.49  . 

0.30-0.39  . 

0.20  -  0.29  . -  .  . 

0.10-0.19  . 

0.00  -  0.09  37  . 


0 

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

to 

9 

19 

29 

39 

49 

59 

69 

79 

89 

99 

Frequency  of  Appearance 
(percent) 

(a)  Title  Words 


Usage 

Rate 


>  ] 

1.00 

660 

20 

4 

- 

1 

- 

1 

- 

- 

1 

2 

0.90 

-  0.99 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

- 

0.80 

-  0.89 

4 

2 

- 

- 

- 

- 

- 

- 

- 

- 

- 

0.70 

-  0.79 

9 

3 

- 

- 

- 

- 

- 

- 

- 

- 

- 

0.60 

-  0  69 

22 

4 

1 

1 

- 

- 

- 

- 

- 

- 

- 

0.50 

-  0.59 

52 

3 

2 

- 

- 

- 

- 

- 

- 

2 

- 

0.40 

-  0.49 

6 

4 

- 

- 

1 

- 

1 

1 

- 

- 

- 

0.30 

-  0.39 

25 

5 

- 

- 

- 

- 

- 

- 

- 

- 

- 

0.20 

-  0.29 

15 

10 

- 

- 

1 

- 

- 

- 

- 

- 

- 

0.10 

-0.19 

6 

1 

- 

- 

- 

- 

- 

- 

- 

- 

- 

0.00 

-  0.09 

1210 

19 

9 

5 

- 

3 

1 

- 

1 

- 

- 

0 

10 

20 

30 

40 

50 

60 

70 

80 

90 

10 

to 

to 

to 

to 

to 

to 

to 

to 

to 

tc 

9 

19 

29 

39 

49 

59 

69 

’9 

89 

99 

Frequency  of  Appearance 
(pe  r  cent) 


(b)  Abstract  Words 


-37- 


APPENDIX  B 


RESULTS  OF  PRE -INDEXING  TRIALS 

C  -  Completeness  in  percent 
R  -  Relevance  in  percent 


Method  I 


A 

B 

C 

Usage  rate 

threshold: 

0.  25 

0.50 

0.75 

Doc  No. 

C 

R 

C 

R 

C 

R 

1 

300 

82 

72 

80  “ 

76 

74  " 

84 

2 

302 

93 

80 

89 

86 

73 

88 

3 

304 

87 

86 

84 

87 

85 

89 

4 

305 

89 

92 

87 

94 

82 

100 

5 

364 

78 

88 

78 

91 

75 

96 

6 

383 

97 

46 

97 

48 

94 

58 

7 

401 

96 

48 

96 

51 

96 

59 

8 

402 

100 

19 

100 

23 

100 

33 

9 

416 

90 

41 

87 

45 

87 

57 

10 

437 

72 

63 

70 

68 

7  6 

73 

1 1 

606 

88 

69 

88 

73 

86 

87 

12 

619 

91 

79 

91 

81 

87 

86 

13 

620 

87 

43 

87 

5 1 

87 

58 

14 

621 

78 

55 

78 

58 

64 

57 

15 

622 

78 

82 

72 

83 

71 

92 

16 

623 

75 

29 

70 

30 

70 

36 

17 

625 

84 

43 

82 

56 

77 

52 

18 

719 

91 

61 

84 

53 

80 

6) 

19 

721 

83 

56 

79 

61 

76 

69 

20 

722 

90 

41 

89 

43 

59 

37 

21 

723 

94 

52 

94 

53 

94 

64 

22 

724 

94 

71 

90 

77 

81 

79 

23 

818 

97 

55 

87 

65 

81 

78 

24 

905 

93 

73 

93 

74 

65 

74 

25 

907 

81 

43 

81 

45 

63 

49 

26 

908 

95 

37 

91 

70 

78 

77 

27 

909 

89 

45 

77 

43 

73 

53 

28 

911 

91 

29 

91 

34 

91 

44 

29 

1102 

92 

65 

84 

68 

80 

80 

30 

1106 

88 

75 

88 

75 

79 

90 

A  ve 

rage 

88 

55 

86 

61 

79 

69 

-38- 


-39- 

Method  II 


Frequency 

threshold: 

Usage  rate 
threshold: 


Doc  No. 


1 

300 

2 

302 

3 

304 

4 

305 

5 

364 

6 

383 

7 

401 

8 

402 

9 

415 

10 

437 

11 

606 

12 

619 

13 

620 

14 

621 

15 

622 

16 

623 

17 

625 

18 

719 

19 

721 

20 

722 

21 

723 

22 

724 

23 

818 

24 

905 

25 

907 

26 

908 

27 

909 

28 

911 

29 

1102 

30 

1106 

Average 


A 

B 

C 

25% 

50% 

75% 

0.  50 

0.50 

0.50 

c 

_R_ 

C 

R 

C 

R_ 

40 

74 

42 

72 

42 

66 

34 

74 

39 

74 

42 

75 

45 

89 

45 

86 

51 

85 

53 

95 

61 

96 

66 

96 

42 

87 

45 

85 

47 

86 

60 

68 

71 

71 

71 

61 

57 

61 

57 

58 

57 

53 

36 

20 

57 

27 

57 

24 

65 

72 

65 

54 

65 

72 

32 

63 

36 

59 

36 

58 

46 

76 

51 

73 

51 

66 

38 

69 

44 

70 

44 

67 

58 

64 

65 

51 

65 

57 

41 

64 

41 

63 

49 

64 

34 

96 

37 

93 

40 

90 

55 

58 

"0 

58 

70 

50 

41 

43 

43 

42 

45 

42 

40 

47 

46 

49 

50 

50 

47 

79 

55 

80 

55 

72 

35 

40 

38 

38 

40 

37 

65 

69 

74 

66 

74 

62 

30 

58 

41 

62 

43 

64 

42 

62 

44 

60 

44 

55 

65 

87 

65 

79 

80 

82 

38 

48 

38 

39 

38 

36 

38 

62 

38 

56 

38 

53 

48 

63 

50 

50 

50 

45 

66 

58 

66 

45 

66 

42 

57 

80 

59 

74 

59 

64 

63 

88 

67 

89. 

6  7 

84 

47 

'  67 

52 

.  64 

53 

,  62 

-40- 


Method  III 


ABC 


Frequency 

threshold:  75%  75% 


75% 


"ambiguous" 

zone: 


Doc  No. 

1 

300 

2 

302 

3 

304 

4 

305 

5 

364 

6 

383 

7 

401 

8 

402 

9 

415 

10 

437 

11 

606 

12 

619 

13 

620 

14 

621 

15 

622 

16 

623 

17 

625 

18 

719 

19 

721 

20 

722 

21 

723 

22 

724 

23 

818 

24 

905 

25 

907 

26 

90S 

27 

909 

28 

911 

29 

1102 

30 

1106 

Avc  rage 


0.2 

'to 

0.5 

C _ R_ 

47  57 

49  70 

60  79 

74  97 

53  77 

74  54 

59  41 

57  17 

69  43 

49  57 

54  60 

47  58 

74  50 

56  58 

49  85 

90  43 

52  36 

60  40 

58  62 

46  31 

77  60 

51  55 

50  50 

85  72 

49  35 

43  47 

66  45 

71  32 

67  59 
71.  74 

60  55 


0.3 

to 

0.5 

C _ R_ 

46  58 

48  71 

60  80 
74  97 

53  77 

74  55 

59  43 

57  17 

69  45 

49  57 

54  60 

47  69 

71-  49 
53  57 

47  86 

90  44 

52  36 

60  41 

58  64 

46  33 

77  59 

48  63 

49  50 

85  74 

47  36 

43  48 

66  47 

71  33 

67  60 

LL  Z± 

60  56 


0.3 

to 


0.  6 

C 

R 

46 

60 

48 

73 

60 

80 

74 

97 

53 

77 

74 

58 

59 

44 

57 

17 

69 

45 

49 

57 

51 

64 

47 

53 

71 

49 

48 

56 

46 

89 

83 

27 

48 

36 

60 

42 

58 

64 

41 

32 

74 

59 

43 

61 

47 

52 

85 

74 

47 

39 

43 

51 

66 

48 

71 

37 

67 

61 

7.1. 

74 

59 

57 

APPENDIX  C 


PROGRAM  STRUCTURE  AND  LISTINGS 

The  programs  for  both  Phase  I  and  Phase  II  of  the  pre -indexing 
system  make  extensive  use  of  pointers  in  order  to  make  data  avail¬ 
able  to  all  subroutines.  The  programs  are  written  in  the  AED  lan¬ 
guage,  which  has  convenient  facilities  for  handling  pointers. 

The  programs  are  listed  and  further  described  by  phase.  The 
Phase  I  programs  are  used  to  analyze  document  subject  indexes  and 
generate  the  dictionary  files.  Phase  II  actually  generates  pre  - 
indexes  by  the  three  different  methods. 

All  programs  are  designed  to  run  on  the  Compatible  Time- 
Sharing  Computer  System  at  M.I.T.  (CTSS). 

PHASE  I 

The  main  program,  MAIN1,  sets  up  all  the  data  handling  arrays. 
MAIN1  makes  room  for  a  list  of  title  and  abstract  words,  a  list  of 
subject  index  words,  and  an  array  for  storing  appearance  and  usage 
data.  Pointers  to  the  various  arrays  are  stored  in  a  directory  array 
PTRS;  Thus  most  major  subroutine  calls  need  only  the  argument  PT, 
which  is  a  pointer  to  the  array  PTRS.  PTRS  also  contains  other  data 
which  is  useful  to  the  subroutines.  A  list  of  the  important  entries  in 
PTRS  follows: 

0  -  Location  of  TA,  the  array  for  title  and  abstract  words. 

1  -  Location  of  SI,  the  array  for  subject  index  terms. 

2  -  Location  of  D,  the  array  for  appearance  and  usage  data. 

3  -  TL,  the  length  of  the  title  section  of  TA.  This  is 

entered  by  a  subroutine. 

4  -  AL,  the  length  of  the  abstract  section  of  TA. 

5  -  A  scratch  location  sometimes  used  to  pass  arguments 

in  subroutine  calls. 

6  -  The  number  of  the  current  document  being  operated  on. 

8  -  Another  scratch  location. 

9  -  The  word  length  in  a  dictionary  lookup. 

10  -  The  file  number  in  a  dictionary  lookup. 


-41  - 


-42- 


11  -  Another  scratch  location. 

12  -  Location  of  BOOK  FILE. 

13  -  Location  of  dictionary  file  in  use. 

Three  major  arrays  are  used  in  the  analysis  of  a  document 
record.  These  arrays  are  TA,  SI,  and  D.  TA  contains  the  words  of 
the  title  and  abstract.  SI  contains  the  subject  terms.  D  is  filled  with 
the  usage  and  appearance  data  for  the  words  in  TA. 

The  TA  Array 

The  words  in  the  TA  array  are  stored  four  characters  per  com¬ 
puter  word,  left  justified  and  blank -padded.  Thus  a  six -character 
word  will  fill  two  computer  words  and  have  two  blank  characters  in 
the  final  two  character  positions.  The  words  stored  in  TA  are 
separated  by  a  computer  word  of  four  blank  ASCII  characters. 

The  length  of  a  stored  word  is  defined  as  the  number  of  computer 
words  it  fills,  including  the  fence  of  blanks  which  separate  it  from 
the  next  word.  Thus,  a  four  character  word  such  as  'film"  has  a 
length  of  2 --one  computer  word  for  the  characters  plus  one  computer 
word  for  the  fence  of  blanks.  The  words  "magnetic”  and  field  have 
the  same  length  of  3 . 

The  length  of  an  array  of  words  is  defined  as  the  sum  of  the 
lengths  of  all  individual  words  in  the  array.  The  length  of  an  array 
does  not  specify  the  number  of  words  in  an  array. 

The  SI  Array 

The  first  computer  word  of  the  SI  array  is  a  header  which  tells 
the  number  of  subject  terms  in  SI.  The  header  is  followed  by  the 
subject  terms.  Each  subject  term  also  has  a  one  computer  word 
header  which  contains  the  length  of  the  subject  term  in  the  decrement 
and  the  weight  of  the  subject  term  in  the  address. 

The  D  Array 

The  D  array  is  used  for  recording  the  usage  and  appearance  data 
for  the  words  in  the  TA  array.  Three  computer  words  in  D  are  re¬ 
quired  for  each  language  word  in  TA,  For  the  nth  word  in  TA  the 


-43- 


contents  of  D  are  as  follows  (the  position  of  data  within  a  word  is 
indicated  by  the  octal  mask) : 

D(3N)  -  Weight  0  usage  from  title,  7C29 

-  Weight  1  usage  from  title,  3  777C 

-  Weight  2  usage  from  title,  777C20 

-  Weight  4  usage  from  title,  1777C10 

D(3N+1)  -  Weight  0  usage  from  abstract,  7C29 

-  Weight  1  usage  from  abstract  1777C 

-  Weight  2  usage  from  abstract,  777C20 

-  Weight  4  usage  from  abstract,  1777C10 

D(3N+2)  -  Weight  3  usage  from  title,  777C27 

-  Weight  3  usage  from  abstract,  777C9 

-  Title  appearance,  7”7C18 

-  Abstract  appearance,  777C 

Data  are  placed  in  the  D  array  in  this  particular  manner  so  that 
it  is  formatted  for  direct  addition  to  the  word  data  already  contained 
in  the  dictionary  files. 

Dictionary  Files 

The  dictionary  files  contain  all  words  encountered  in  titles  and 
abstracts,  as  well  as  the  cumulative  usage  and  appearance  data  for 
each  word.  The  files  are  sorted  only  by  word  length.  Each  file  is 
two  tracks  (864  computer  words)  long.  Thus,  there  is  a  series  of 
files  for  each  possible  word  length.  A  file  is  specified  by  two  names, 
such  as  . .  .  M03  .  . .  N04.  The  first  name  indicates  the  length  of  words 
contained  in  the  file,  in  this  case  3.  The  second  name  indicates  the 
number  of  the  file  in  the  series  of  files  for  the  given  word  length.  In 
this  case  the  file  is  the  fourth  of  the  dictionary  files  having  a  word 
length  of  three . 

For  storage,  each  word  in  a  file  is  followed  directly  by  three 
computer  words  containing  data  about  appearances  and  usage.  The 
data  words  have  exactly  the  format  described  for  data  storage  array  D. 


-44- 


3 


The  Book  File 

With  the  dictionary  files  as  described  above,  a  bookkeeping 
system  is  necessary  to  keep  track  of  the  number  of  files  in  each  word 
length  series,  and  to  monitor  the  number  of  words  stored  in  the  last 
number  file  of  each  series.  The  Book  File  maintains  this  vital  infor¬ 
mation.  If  a  file  word  length  is  M,  then  the  number  of  files  in  the 
series  is  stored  in  location  2M  and  the  number  of  words  contained  in 
the  last  file  of  the  series  is  contained  in  location  (2M+1)  of  the  Book 
File. 

Operating  Phase  I 

To  operate  Phase  I,  all  Phase  I  programs  must  be  loaded  and 
started  from  a  CTSS  console.  The  files  CATDIR  FILE  and  CATREC 
FILE,  which  contain  the  document  records,  must  be  available.  The 
program  will  type  the  number  of  documents  analyzed  to  date,  then 
request  the  number  of  a  new  document  to  be  analyzed.  The  user  re¬ 
sponds  by  typing  a  four  digit  document  number  for  the  new  document 
to  be  analyzed.  The  computer  will  signify  when  the  analysis  is  com¬ 
plete  and  wait  for  another  document  number.  The  Phase  I  programs 
will  not  analyze  a  document  which  is  missing  a  title,  abstract,  or 
subject  index  field. 

Phase  I  automatically  collects  all  data  and  builds  the  dictionary 
files . 


■V.  K  >  fej 

V  «  K)  «  lit 

UJ  Z  Z 

ww  o  *■ 


Z  O  0  - 
Out  O 

—  —  >•  a 
*-  u.  « 

UM>0 

-us  a: 

*  UW4U 

*  a-  * 

ac  *  z 
<o  —  at 

*  u.  *  o 

i  **  o  u- 

1240 


u.  Z  *A  * 

ku  -  *-  WO  t 

tuz  ft.  -1  • 

-  O  “  Hi  o  U  C 

»OiZ 

u>  *-  O  v 

»  UU&  ?  'w  ' 

J  O  a  ».  —  ♦ 

-  cc  0*0*0 
3& wu*wQU( 

a  xuuv 

LSU«U  X)  •  6 
UWUUI  *-  • 

J  Ouj  U  *  W  OJ 
u  uj  U  uj  ■ 

-  o  ►  ^  W  WO  l 

r  <r  a  /  *  a  j* 

-  —  a—  I a  u  U  v 


.  a.  u  z  » 

■  »  #  •»  •  0 

X  •  * 

:  u  w  x  1 

)  s  I  v  I 

t  i/i»-  «* 


-  X 

«  mi  — 

*-  o 

&W1  * 

i*uc 
►  xo* 
OhuI 

•zs  z 


E  *4  u.  rf.  O 
u  &  —  —  O 


m  *  *»•  x  a 

IV  •  X  *“  X 

Z  •  N  O 

•  **  s  m  o  M.  Z 

*»  U)  *  »  W  O 

I  —  X  NO  — 

A  —  W  A  *_i  kl  A 

—  A  •*-«.*  Ml 

»  ■  O  *  •  a  *» 

MWNlKN&iuW 
in  O  m  no  XX 


«  •  m  ui 

o  -  •*> 

kk  «  A*  o 

•  •  o 

-i  —  z  o  • 

o  O  *  -  * 

U  O  — 

.*  *  m  —  *- 

<  »  *«- 
Owl4 
•WK  * 
—  -*  ■  Q 
UW  *  « 
O  •  Z  * 

»XWUJ 

»0«L*> 

—  -  o  O  X 


«.  O  •  W.  *  —  A  ' 

q  yj  i 

Ngow^j  ( 

«*0  <i 

—  O  *  »  kU  *-  t 

«Ul|  k  (  | 

A-  •  O  *  D  kw 

kk  U  O  o  *  ^ 

-*  ►  O  I  o  *A>  o 

o  «  Ui  1  U  U  ft  T  ^ 

-i  «  -O  5  O  I  —  < 

U.  *  O  » ^  V. 

s  <!UJwtu«g) 

w  «.  •  a. 

HMUUUWW  ►  » 

OOuiwOOAk*.?: 

wuiggww  ■  «/>  c 

—  o  o 

//•■<<  « 


O  JU« 
JIAU 

-  O  «  X 

•  O  O  vn 

AO  O  — 

-  Z  *  u.  Z 

X  UJ  —  • 

M  O  *  O  A.  A 

to  A 
3Q WWOt 

j  O  ac  *-  O 

3*-  O  *- 

3  krt  *  *  O  k^ 

WM  I 
k  *  Z  X 

D  til  w  oj 
tuO  XX 
I  k  <  •  *>  ►* 
u  ,  m 

o*o  o  a 

L  •  ft.  **  « 

0  •  ►  *  -J  v» 

r  *-  —  o 

E  •  X  «  w 


•  z  —  u  ©  *  •  *  *  >r> 

w/O  Z  «>  •  —  — 

•  X  «  O  <  •  «  -v»  ,  . 

A  *  ^  ►»  ^  A  •  J  u  * 

«h  tics  —  O 

-  o  *  +  z  *  -*  -  —  a.  o  *- 


•  U  n.H'U*  A  *  —  a  A  w  —  •>  «»  —  «- 

I  ►  «  *■  «■  —  "tl*#Zt-»-4 

’  r  totutuMo  o  o  -  x  ■  o 

i  —  «  a  o  •  a  ftu.«-ooC-~  —  «a 

>a  uwtug  •  Z  lugs  tuu 

.  0.  *.  A  —  ■*.">**-  —  «i  J  LI  II  t  *  ij 


r§  to 

*  *  •  * 

“o  §s- 


u  -  -  ft 
-  X  O  —  w  o 
x  r  4  v1  u 
-•-  —  ■ox 

•  Z  %  »  ^  A. 

*  -  «k  i  gg 

■  o 


HID  1H£N  CLOSE  Cl««ENT  Mlf 


COU  START*  t,  NSBOb  CNFLOl  TRAN  ALGOL  FOR  *>**0 1  *»Tt  05156B 

END  *•  Bt CIA 

'J-MlNClHIRful  «,  OF  f  I  hi  INTEGER  FACCEOURE  WNLFIPTI  WHERE  INTEGER  FT  TORE 

CC»M-.f  IF  A  «C*C  "AS  BEEN  REPLACED  *»  AN  ASTERISK  THEN  FA5S  IT  t.  COMMENT  l"IS  FRCCEOURE  DETERMINES  THE  NUMBER  OF  NOROS 


mf.blankw, BLANKS. CHAR. NUN  | 


ut  tf  *o«c  1 t  cite  i*  **»*$*//  |  if  *  cm  *u**  thin  cotc  huh  t 


hoi  n‘«i«  •iMi'n  i»i««  ui'miHini  Hnomii  ’•mo  ••  a  *»»■»  o  hum  Uig  n»o*n  umi  11m  m  11  ••••  n  i«n«o) 

*•»*•  •*  I  •>»  #»» 

•  ItltO  *tb*  •)«•  Ml  1091*  N»m  I01M)  ♦</»(«  ••  H«lt  JIM  IH  •  m  II 


f*C  I. 

MMm  PflCCfCUM  CCTf  ll  hf||(  TftE,  «CT»M  tC*f 


The  program*  of  Pha»*  II  irt  v^ry  timilar  t<  xh'**  >f  Fha«<  I. 

The  structure  of  I  A  and  SI  are  nearly  identic  a*  to  the  same  ar¬ 
rays  of  Phase  I.  The  dictionary  used  as  a  reference  for  Phase  II 
is  just  that  generated  in  Phase  I.  The  major  difference  in  the  pro¬ 
gramming  for  Phase  II  is  first,  that  rather  than  place  data  into  the 
dictionary,  Phase  II  ex;racts  data  from  the  dictionary.  Second, 

Phase  II  must  operate  on  the  extracted  data  to  form  a  pre-index. 

Thus,  changes  have  been  made  to  the  D  array;  also,  two  extra 
arrays,  P  and  PI,  have  been  provided  to  help  in  generating  pre¬ 
indexes.  Thus  the  directory  array,  PTRS,  for  Phase  II  contains 
two  additional  entries.  PTRS{14)  contains  a  pointer  to  P;  PTRS(15) 
a  pointer  to  PI  . 

The  D  Array 

The  D  array  is  still  used  to  store  the  usage  data  for  all  words  in 
TA.  The  usage  data  in  D  can  than  serve  as  an  immediate  reference 
for  evaluation  of  pre -indexes.  However,  in  Phase  II  only  one  computer 
word  for  D  is  used  to  store  usage  information,  rather  than  the  three 
computer  words  used  in  Phase  I. 

The  P  Array 

The  data  extracted  from  the  dictionary  for  all  words  of  are  placed 
in  array  P.  One  computer  word  in  P  is  used  for  each  word  of  TA. 
The  data  placed  in  P  are  the  usage  rate  for  the  word  and  the  frequency 
of  appearance  of  the  word. 

The  PI  Array 


The  PI  array  is  used  for  flags  which  mark  those  words  se¬ 
lected  for  a  pre-index.  The  rightmost  nine  bits  of  a  word  in  PI 


»rr  u*cd  to  indicate  *ele(t*d  ~z.fr*  i  .*»•  *te  rrtrrvrd  ior 

each  method  in  order  to  handle  three  parameter  »et»  simultaneously. 

The  PARA  Array 

One  further  set  of  data  is  necessary  to  operate  Phase  II.  Three 
sets  of  parameters  must  be  specified  for  each  pre-indexing  method. 
PARA  holds  18  parameters  in  total.  Method  I  requires  three  param¬ 
eters  for  three  trials;  method  II,  six;  and  method  III,  nine. 

Operation  of  Phase  II 

To  operate  Phase  II,  all  Phase  II  programs  must  be  loaded  and 
started  from  a  CTSS  console.  The  files  CATDIR  FILE  and  CATREC 
FILE  must  be  available.  The  program  will  request  the  user  to  input 
the  parameter  sets  for  each  method.  The  parameters  must  be 
specified  as  two-digit  integers  on  a  scale  from  1  to  31.  The  pro¬ 
gram  converts  the  parameters  to  appropriate  usage  rate  for  frequency 
thresholds . 

Once  the  parameters  have  been  typed  in,  the  program  will  re¬ 
quest  a  document  number.  The  user  may  type  a  four -digit  number. 
When  the  pre-indexes  have  been  run,  the  program  will  printout  a 
summary  evaluation  and  wait  for  the  next  document  number. 

To  obtain  a  full  listing  of  the  pre-indexes  for  a  document,  the 
user  may  type  a  1  after  the  document  number.  The  •  s-indexing 
system  will  print  a  table  showing  each  word  of  the  title  and  abstrc  ct. 
and  show  whether  the  word  has  been  included  in  a  pre -index  for  each 
method  and  each  parameter  set. 


z  ►  ar  ui 

♦  O  —  t 

•+  <  <  o 

•  •  T  *“  O 


■  —  4  -  •  *-  U  « 

.  h-  e:  —  c  <*  a 

•  44  ►  -  •  4  •*  ' 

,  t  ►  ft  O 

y  vi  *  —  a  a  »  ^  I  l 
-  O 

r  it.  cj  ^  n  j  u  o  —  »• 

•  oO**-4«  —  4- 
140*00»40—  3 
UOUlOwUC^V 


O'-  O 

0  4*  -  • 

at  «r  X  * 

—  4  •  ILI 

O  ft  O 

•  •  -i  O  *• 

—  —  —  *4 

OMl  O  * 

W-  • 

ff*  -  *  a.  or 

—  v'  a  • 

—  at  ©  o  i4  • 

*> •■*  *  —  m  «* 

•  i  •  •  o 

*»  »«  K  O  M  • 

o  —  ••UI  »  ►-  4 

O  <4  *-  *  4  O  UJ 

N  ^  O  4  4  t/»  X 


»#  / 

KID  •  4  0  fM  3  o 

>>0«ZirwO 

4  4  **/  * 

44  (Nil  IOI  ■ 

4  at  ►  •  4  u  » i. 

44aZMWtU<*t 

at  at  i 


jr*)0Ul>wM*Ml904li 

Mitnu  wutuuuMW  a  * 

lu/xtritxi  i 


X  4  X 

*4  at  m  — 

4  N. 

0  4*0*  • 

O  *4  DU  4 

K  KM  M  *i 

—  «u  at 

a  4  4  u  a  u 

4  p  —  c 

S  —  4  x  at  at  X 

x  a  k  o  k 

a  •  *iw  a  •  a 

m.  —  a  4  a 


4  •  —  ill  4  •  4  u* 

a  ■»  a  a  4  * 

•  4  a  n  a  a 
ani«<  w  at  4 

a  4  a  a  o  at 


004  004  —  — 

U  ••40  •  •  4  u  — 

,4  •  •  4  4  ■  •  4  4  ►  M 

o*oo4u**ooa.±  *44 

sbm  cam  ■« 

^■-xxfto-xxa*--oo 

I  a 

>049 IWU*4KWU*M 


**  •  4 

—  4  • 
*-  *04 
4433 


a  -  a 

J  4  >/»  *-  - 

•  J  y  »  -• 

4  ►*•  4»  *“ 

j  a  a  4i 

j  «•  a  ^ 

*  N  4  o  a 

I  4  —  O  X 


I ►  l  I  O  THEN  PIGIN 

COME**  If  a  FEIlt  IS  MISSING  ?M€N  MAINT  FEltO  NOSE* 


- 

4 

4 

4 

** 

4 

■ 

• 

4 

X 

4 

X 

4 

m 

4 

a 

U  N 

mt 

u 

—  X 

m 

4 

4  4 

W 

mm 

W  4 

■v 

X 

4 

x  x 

X 

V 

4  * 

u 

*! 

X 

• 

4 

a 

w 

4 

X 

A 

2 

• 

X 

* 

4 

VJ  — 

Uj 

X 

J 

4 

¥» 

4  4 

-J 

-» 

4  C 

2 

w 

• 

4 

4 

uj  w 

o 

4 

4  * 

2 

UJ 

«fS 

4  O 

* 

« 

4 

c 

UJ 

•  X 

ml 

►  4  4 

^  4 

X 

4 

X 

*u  o 

2 

UJ 

>*  U 

UJ 

► 

4  4 

V  UJ  O 

U 

X 

X  ^  O 

« 

w 

xw* 

4 

X.  mi  W 

* 

CJ  M«u 

X 

4  41 

X  ►* 

4* 

Ul 

UJ 

o 

«  *  X 

x 

UJ  —  O 

u 

> 

x  o«- 

X 

UJ 

«  4  z 

•  m 

UJ 

•  4  •  z  at 

*  vi 

a 

OVIK 

X 

X  - 

m  m 

VI 

4  0  4  —  4 

«* 

X 

41  11  l 

2 

ml  4  a 

X  2 

U 

2  4 

u 

X 

4  4 

« 

2  0  4 

UJ 

UJ 

X  UJ  X  4  VI 

X  4 

2  UJ 

4 

u  4 

w'  at 

w 

X 

4  W  4  O 

VI 

•  o 

-  *  X 

■» 

4  ui 

—  J 

w 

—  —  UJ  >• 

— 

X 

4  UJ 

o 

uj  4 

4- 

2  a 

2  w 

— 

2  4  z  -)  u. 

2  uj 

> 

4 

UJ  UJ  > 

—  UJ 

— 

—  UJ  —  £  — 

u 

X  u 

2 

x  ®  u. 

u.  at 

U.  v»  u  ml  u 

u 

M» 

4  X 

O  X  -  * 

a 

V)  UJ 

_> 

X  ►* 

4 

O  O  O  4  O  o 

O 

4  0*0  a 

4  O  C 

2  4 

2 

X  * 

O  2  w 

U~  K- 

4  *404 

-  O 

UJ  uj 

o  * 

o  a  — 

oo 

X 

0*04 

O  UJ 

U.  -j 

-  X  X 

-J  *  • 

OviZUU 

w 

4  z  o  o 

2  O  > 

Ml 

4  4  *- 

u.  •*  - 

4  _J  X 

o 

H 

u  • 

0  4  0- 

X 

—  4 

UJ  UJ  *  » 

2  2 

X 

2  •  2  -j  • 

z 

X 

4  U. 

V>  vi 

UJ  *  — 

O  —  •  2  UJ  UJ 

UJ 

z 

UJ  4  UJ  UJ 

2  UJ 

S  O  i 

NZU 

O  -J  o  X 

u.  j  U. 

x  x 

♦ 

w 

X  A  — 

u. 

A 

O  O  -J 

4  »  CD  «u  *  • 

4  —  U 

4  — 

w  — 

4 

4 

•  4  Lb  <4* 

UJ  4  4 

UJ  •—  4 

•  4  2 

a 

2 

4  4 

z 

2  2  4 

4  4  -J 

at  -j  z  z 

•  UJ  •OOOX-UJ 

0  0X04 

a  o 

Uj  UJ 

4  -*  NU  o  - 

c  a 

3  ►  x  ui 

vi«l^ 

2 

vt 

2 

—  UJ  — 

X  X  U. 

a  »  m  —  x  *  4 

w  O  * 

a  *  2 

o  •/>  o  « 

O  «/i  or  j  */»  — 

O  4 

4  — 

at  Q  «• 

UJZfO 

•  •  —  _j  uj  O  uj  O 

— 

-J  UJ  UJ  O  * 

-J 

4 

o  u.  4  a. 

4  •  *-  a. 

o  u  a  Z 

4 

u 

ml  U  UJ  •  4 

u 

ml 

C 

w  o  at  4  • 

O  -  of  x 

•  aw 

2  4 

4 

4 

2  ®  • 

uj  5  4 

at  ZK  O  • 

—  —  UJ 

oc 

UJ 

—  Of  —  •  — 

UJ 

O  ®  •  * 

JU1  X  *  -J  —  4 

a  u  o  *-  >■ 

a  Q  4  U  4 

«  2  0  0  0 

w 

at 

xo«D4>r  xooa 

ml 

* 

O  w  —  *  ^ 

•  •  ■  ■ 

*  4 

• 

• 

• 

*■ 

u  ac  w  uj  at 

at  *  a  a  o 

— 

«■ 

— 

U  *  •  2  1 

-  _j  a  u  x  a 

uj  w  uj  uj 

4  a  4  «5 

a  o  w 

«r  «  a  4  a  m 

a 

«•  a  D  4 

0  0  0  0  4 

—  ~r  *  * 

o 

— 

♦ 

uj  Uj  uj  uj  b 

2  UJ 

te 

4 

4 

2  4  *  4 

2 

t£ 

—  O  IM  j 

» 

■  ■  a  a 

a 

• 

a 

a 

a.  ax 

a 

■  4  j 

r  u  u  wo  o  w 

2  4  2  X 

< 

2  z  —  — 

UJ  z 

u.  -wu.  z  - 

4 

4  2  UJ  1 

UJ  A  ~  — 

uj  o  uj  —  a. 

- -  —  X 

X 

O  U.  M  M 

j 

> 

2 

2 

—  U.  4  1 

a 

o 

X  *-  at 

—  u.  X 

•"* 

X 

*" 

w 

2 

2 

2 

4  X  UJ  M  A 

X 

UJ 

UJ 

4 

UJ 

4  «•  W 

ulUU 

X 

X 

X 

(J  U.  A  U 

e 

A 

A 

A 

/  u  u 

*  uj  O  Z  2 

o 

o 

2 

(J 

WOO  —  — 

as 

o 

u 

X 

o 

j  r*  4  WO  ♦ 

i  4  z  a  w  4 

>  a  oj  o  u  a. 

>  *x  •-  * 

JNO  ^  2 

)  u.  a  z  m.  ♦  * 

)Qx  -  •  -  * 

r  O  »0  •  3  *-  ' 

.  UJU  <•  w  3u  a.  < 

01  WWW  —  ■ 

i  -»  *  u  fc  •  Jtl 

j  o  uj  »-  »  •  »  « 

>  w  O  4  u,  w  *  u  : 

j  U  «  i''  7  /  71 

■  O  *-  «  UMU  oo  W< 
r  «  /  at  a  _>  *  •-  • 
■a-iauu^ai 


•  f  «  >4  M 

4  O  X  —  UJ  O  4 

>  4  —  or  ♦  o  o 

>  4  2  •  m  *  u 

?  4  a.  ♦  *  4  2  k  ~  o 

J  >«*  •*  4  •  — 

•-  a  •  m  u  w  u 

!►  a  -  a  u. 

r  4  ~  2  a  a  2  — 

>  X  »V 

I  4  4  u. 


&C  X  « 

~  a 

x  *  wo 

4  *  o  « 
O^wl 
•X  *  * 

JB  J*  A  U 


COFFENT  (•>  *0*0  IS  HOI  FOUNC  IN  THE  OtCTlONAftV  FILE  THEN  SEAftCH 
FC*  The  NEXT  mQAC  OF  THE  SAFE  LENGTH  «, 

LCCFX  FTX2-PTF *♦«*  *, 

teA02-*M 2*»  i, 

IF  PI tf  G'O  PTENO  THEN  BEGIN 


«  ft*  o 
♦  * 

T  »-  O  * 
uia  j* 
X  - 

►  J2>- 


O  ft.  Ik  • 

ft>  w  «• 

•  *-  *  X 

N  *ft  Z 

i  ♦-  U.  -  -  O 


—  —  a  *-  * 

■  *i«  uu 

i  x  a  u.  o  c 

i  ■  *  i 

t  X  -  v 


*  *  4ta«JO*BO»ft 

•  -4  —  *  *  *-  <»  *-  *  » 

•JKSO  JO  U.  O  ft 
K  1  U.  XUO»UQO 


I  |  ••••  &  4I»> 

*  *-  -»  -J  "  V  s 
7  2  vf  ■ 


3  •  J 

JO««< 


*■  «  X  . 

X  X  X  O  1 
wo*-  •  i 
u  a*  -J  * 

—  X  o  x  , 

«>  r  •  i 

ft*  B  —  o 

-  «  *4 

o*>n 


*  ft*  w  o 
I.-1W 


•I  I  Ml  *“  C  « 

®  ft*«  •  2  N  O- 

*^3»«X  ft*  l 

ft-  o  JtK  o  i 

X  ►  W  o  c  -  o« 

ft*  «  vJ  X  ft  —  *»■ 

►  «o-u  o 

ft  t/l  O  « 

ft  ■  *  Jft  W  »  ' 

WWWUWi  B  « 

OOOVMOO-X  1 
Him  'VJ  u  *■  ■  T  u 

£  £  4C  ft  C  *  -*  *■ 

—  —  —  a  —  b  *.  ft 


►  VJ  u 

O  ex 

Uf  >*  • 


‘  •  —  O  O  X 

it  •*  •  *  ft*  C  ft* 
*  *  O  J  o  — 

k.  —  ft*  Z 

—  O  at  «  — 

•  *  *  *  ft  ■h  * 
x  •  ft  —  ftl  X  X 

J  ft-  J  '/>  ft  u  u  u 

•  ft.  *  X  •-  - 

r  •*  ft  OUftwO 
lot**  --OI 
.  3  O  ft.  « 

J  —  £  — 


ooo- 

ft.ft.W-o 

O  O  O  ftJ 

o  o  o  m 


o  o  a  o 
ft*  ft*  *v 
ftv  ft»  ir  ** 
o  o  o  o 


O  O  O  o  o 

ft*  ft*  ft*  O  “ 

O  O  O  ft*  D 


*»  X  I  I.  X  'X  *-  « 
O  O  o  ■  rj 
O  H  O  C 

<f  ft  ft.  ft.  ft.  4 


I 


c*l  *  rnf*  GCFC  STAAfi  t 


57 


CCt#f\T  Ct  »  LlKCI»  *M0  *U«M*  0*  fH«  ItCCIMfO  Mlf  I 


I 


60 


BIBLIOGRAPHY 


1  .•  The  American  University,  Machine  Indexing:  Problems  and 

Progress  (Papers  presented  at  the  Third  Institute  on  Informatic 
Storage  and  Retrieval,  February  13,  17,  1961),  Washington, 

D.  C.,  1962,  3541, 

2.  Baxendale,  P,  B.  "An  Empirical  Model  for  Computer  Indexing," 
in  Machine  Indexing,  American  U, ,  1962,  pp  207-218. 

3.  Bohnert,  L.  M.,  "New  Role  of  Machines  in  Document  Retrieval: 
Definitions  and  Scope,"  in  Machine  Indexing.  American  U.,  1962, 

pp.  8-21. 

4.  Luhn,  H.  P.  (ed),  Automation  and  Scientific  Communication, 

Short  Papers,  Part  1,  American  Documentation  Institute, 
Washington,  D.  C.  ,  1963,  pp.  1-128. 

5.  Luhn,  H.  P. ,  "Keyword -In -Context  Index  for  Technical  Literature 
(KWIC  Index)"  in  Amer.  Documentation  11,  I960,  pp.  288-295. 

6.  Maron,  M.  E.,  "Automatic  Indexing:  An  Experimental  Inquiry," 
Report  No.  P-2180,  1  September  I960  (rev.  2  Feb.  1961),  p.31. 

7.  McCormick,  E.  M.,  "Why  Computers?"  in  Machine  Indexing, 
American  U.,  1962,  pp.  220-232. 

8.  Overhage,  C.,  Harman,  R.  J.  (ed).  INTREX;  Planning  Conference, 
M.I.T.  Press,  Cambridge,  Mass.,  1965. 

9.  "Project  INTREX;  Semiannual  Rep  .  t,  "  Massachusetts  Institute 
of  Technology,  15  Sept.  1967. 

10.  "Project  INTREX:  Semiannual  Report,  "  Massachusetts  Institute 
of  Technology,  15  March  1968. 

11.  Salton,  G.  (ed),  "Information  Storage  and  Retrieval,"  Scientific 
Rept.  No.  ISR-l'i,  Department  of  Computer  Science,  Cornell 
University.  Ithaca,  New  York,  June,  1966. 

12.  Stevens,  Mary  E.,  "Automatic  Indexing"  A  State- of -the- Art  Re  - 
Port,"  U.  S.  Department  of  Commerce,  National  Bureau  of 
Standards,  NBS  Monograph  91,  30  March  1965. 


'61  - 


